Constructing Cross-lingual Consumer Health Vocabulary with Word-Embedding from Comparable User Generated Content (2206.11612v2)
Abstract: The online health community (OHC) is the primary channel for laypeople to share health information. To analyze the health consumer-generated content (HCGC) from the OHCs, identifying the colloquial medical expressions used by laypeople is a critical challenge. The open-access and collaborative consumer health vocabulary (OAC CHV) is the controlled vocabulary for addressing such a challenge. Nevertheless, OAC CHV is only available in English, limiting its applicability to other languages. This research proposes a cross-lingual automatic term recognition framework for extending the English CHV into a cross-lingual one. Our framework requires an English HCGC corpus and a non-English (i.e., Chinese in this study) HCGC corpus as inputs. Two monolingual word vector spaces are determined using the skip-gram algorithm so that each space encodes common word associations from laypeople within a language. Based on the isometry assumption, the framework aligns two monolingual spaces into a bilingual word vector space, where we employ cosine similarity as a metric for identifying semantically similar words across languages. The experimental results demonstrate that our framework outperforms the other two LLMs in identifying CHV across languages. Our framework only requires raw HCGC corpora and a limited size of medical translations, reducing human efforts in compiling cross-lingual CHV.
- D. C. DeAndrea and M. A. Vendemia, “How Affiliation Disclosure and Control Over User-Generated Comments Affects Consumer Health Knowledge and Behavior: A Randomized Controlled Experiment of Pharmaceutical Direct-to-Consumer Advertising on Social Media,” Journal of Medical Internet Research, vol. 18, no. 7, p. e189, 7 2016.
- L. Zhou, D. Zhang, C. C. Yang, and Y. Wang, “Harnessing social media for health information management,” Electronic Commerce Research and Applications, vol. 27, pp. 139–151, 1 2018.
- X. Zhang, F. Guo, T. Xu, and Y. Li, “What motivates physicians to share free health information on online health platforms?” Information Processing & Management, vol. 57, no. 2, p. 102166, Mar. 2020.
- S. Chen, X. Guo, T. Wu, and X. Ju, “Exploring the online Doctor-Patient interaction on patient satisfaction based on text mining and empirical analysis,” Information Processing & Management, vol. 57, no. 5, p. 102253, Sep. 2020.
- S. Khurana, L. Qiu, and S. Kumar, “When a doctor knows, it shows: An empirical analysis of doctors’ responses in a Q&A forum of an online healthcare portal,” Information Systems Research, vol. 30, no. 3, pp. 872–891, 2019.
- D. Wu, H. Xu, and S. Fan, “How do consumers acquire health information? a pattern analysis on online health consultation,” Proceedings of the Association for Information Science and Technology, vol. 56, no. 1, pp. 813–815, Jan. 2019.
- Q. T. Zeng, T. Tse, G. Divita, A. Keselman, J. Crowell, A. C. Browne, S. Goryachev, and L. Ngo, “Term Identification Methods for Consumer Health Vocabulary Development,” Journal of Medical Internet Research, vol. 9, no. 1, p. e4, 3 2007.
- K. M. Doing-Harris and Q. Zeng-Treitler, “Computer-Assisted Update of a Consumer Health Vocabulary Through Mining of Social Network Data,” Journal of Medical Internet Research, vol. 13, no. 2, p. e37, 5 2011.
- L. Jiang and C. C. Yang, “Using co-occurrence analysis to expand consumer health vocabularies from social media data,” in IEEE International Conference on Healthcare Informatics, ICHI 2013, 9-11 September, 2013, Philadelphia, PA, USA, 2013, pp. 74–81.
- V. G. V. Vydiswaran, Q. Mei, D. A. Hanauer, and K. Zheng, “Mining consumer health vocabulary from community-generated text,” in AMIA 2014, American Medical Informatics Association Annual Symposium, Washington, DC, USA, November 15-19, 2014, 11 2014, pp. 1150–1159.
- Z. He, Z. Chen, S. Oh, J. Hou, and J. Bian, “Enriching consumer health vocabulary through mining a social Q&A site: A similarity-based approach,” Journal of Biomedical Informatics, vol. 69, no. C, pp. 75–85, 5 2017.
- L. Hou, H. Kang, Y. Liu, L. Li, and J. Li, “Mining and standardizing chinese consumer health terms,” BMC Medical Informatics and Decision Making, vol. 18, no. 5, p. 120, Dec. 2018.
- G. Gu, X. Zhang, X. Zhu, Z. Jian, K. Chen, D. Wen, L. Gao, S. Zhang, F. Wang, H. Ma, and J. Lei, “Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach,” JMIR Medical Informatics, vol. 7, no. 2, p. e12704, 5 2019.
- M. Ibrahim, S. Gauch, O. Salman, and M. Alqahtani, “An automated method to enrich consumer health vocabularies using GloVe word embeddings and an auxiliary lexical resource,” PeerJ Computer Science, vol. 7, p. e668, Aug. 2021.
- A. Deardorff, K. Masterton, K. Roberts, H. Kilicoglu, and D. Demner-Fushman, “A protocol-driven approach to automatically finding authoritative answers to consumer health questions in online resources,” Journal of the Association for Information Science and Technology, vol. 68, no. 7, pp. 1724–1736, 2017.
- F. Moramarco, D. Juric, A. Savkov, J. Flann, M. Lehl, K. Boda, T. Grafen, V. Zhelezniak, S. Gohil, A. P. Korfiatis, and N. Hammerla, “Towards more patient friendly clinical notes through language models and ontologies,” AMIA Annual Symposium Proceedings, vol. 2021, pp. 881–890, Feb. 2022.
- Q. T. Zeng and T. Tse, “Exploring and Developing Consumer Health Vocabularies,” Journal of the American Medical Informatics Association, vol. 13, no. 1, pp. 24–29, 1 2006.
- Q. T. Zeng, D. Redd, T. Rindflesch, and J. Nebeker, “Synonym, Topic Model and Predicate-Based Query Expansion for Retrieving Clinical Documents,” in AMIA 2012, American Medical Informatics Association Annual Symposium, Chicago, Illinois, USA, November 3-7, 2012, 2012, pp. 1050–1059.
- W.-H. Lu, R. S. Lin, Y.-C. Chan, and K.-H. Chen, “Using Web resources to construct multilingual medical thesaurus for cross-language medical information retrieval,” Decision Support Systems, vol. 45, no. 3, pp. 585–595, 2008.
- W. W. Chapman, D. Hillert, S. Velupillai, M. Kvist, M. Skeppstedt, B. E. Chapman, M. Conway, M. Tharp, D. L. Mowery, and L. Deleger, “Extending the NegEx lexicon for multiple languages,” Studies in Health Technology and Informatics, vol. 192, pp. 677–681, 2013.
- M. Alfano, B. Lenzitti, G. Lo Bosco, and D. Taibi, “Development and practical use of a medical Vocabulary-Thesaurus-Dictionary for patient empowerment,” in Proceedings of the 19th International Conference on Computer Systems and Technologies, CompSysTech 2018, Ruse, Bulgaria, September 13-14, 2018, Sep. 2018, pp. 88–93.
- A. Rahimi, T. Baldwin, and K. Verspoor, “WikiUMLS: Aligning UMLS to Wikipedia via Cross-lingual Neural Ranking,” in Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics, Dec. 2020, pp. 5957–5962.
- L.-H. Chen and K. Kageura, “Multilingualization of Medical Terminology: Semantic and Structural Embedding Approaches,” in Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, 5 2020, pp. 4157–4166.
- P. Wajsbürt, A. Sarfati, and X. Tannier, “Medical concept normalization in French using multilingual terminologies and contextual embeddings,” Journal of Biomedical Informatics, vol. 114, p. 103684, Feb. 2021.
- C. Teixeira Lopes, D. Paiva, and C. Ribeiro, “Effects of language and terminology of query suggestions on medical accuracy considering different user characteristics,” Journal of the Association for Information Science and Technology, vol. 68, no. 9, pp. 2063–2075, Sep. 2017.
- C.-H. Chang and C. C. Yang, “On bridging consumer health search across languages using cross-lingual word space,” Electronic Commerce Research and Applications, vol. 59, p. 101254, May 2023.
- T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and Their Compositionality,” in Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, 2013, pp. 3111–3119.
- S. L. Smith, D. H. P. Turban, S. Hamblin, and N. Y. Hammerla, “Offline bilingual word vectors, orthogonal transformations and the inverted softmax,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
- G. Lample, A. Conneau, M. Ranzato, L. Denoyer, and H. Jégou, “Word translation without parallel data,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
- K. Frantzi, S. Ananiadou, and H. Mima, “Automatic recognition of multi-word terms:. the C-value/NC-value method,” International Journal on Digital Libraries, vol. 3, no. 2, pp. 115–130, 8 2000.
- K. Marko, R. Baud, P. Zweigenbaum, L. Borin, M. Merkel, and S. Schulz, “Towards a multilingual medical lexicon,” in AMIA 2006, American Medical Informatics Association Annual Symposium, Washington, DC, USA, November 11-15, 2006, 2006, pp. 534–538.
- L. Campillos-Llanos, “First Steps towards Building a Medical Lexicon for Spanish with Linguistic and Semantic Information,” in Proceedings of the 18th BioNLP Workshop and Shared Task. Florence, Italy: Association for Computational Linguistics, Aug. 2019, pp. 152–164.
- M. Neumann, D. King, I. Beltagy, and W. Ammar, “ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing,” in Proceedings of the 18th BioNLP Workshop and Shared Task, BioNLP@ACL 2019, Florence, Italy, August 1, 2019, 8 2019, pp. 319–327.
- O. Bodenreider, “The Unified Medical Language System (UMLS): integrating biomedical terminology,” Nucleic acids research, vol. 32, no. Database issue, pp. D267–D270, 1 2004.
- Y. Fujinuma, J. Boyd-Graber, and M. J. Paul, “A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, 7 2019, pp. 4952–4962.
- A. Hendy, M. Abdelrehim, A. Sharaf, V. Raunak, M. Gabr, H. Matsushita, Y. J. Kim, M. Afify, and H. H. Awadalla, “How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation,” Feb. 2023.
- E. Kamalloo, X. Zhang, O. Ogundepo, N. Thakur, D. Alfonso-Hermelo, M. Rezagholizadeh, and J. Lin, “Evaluating Embedding APIs for Information Retrieval,” May 2023.
- E. Lehman, E. Hernandez, D. Mahajan, J. Wulff, M. J. Smith, Z. Ziegler, D. Nadler, P. Szolovits, A. Johnson, and E. Alsentzer, “Do We Still Need Clinical Language Models?” Feb. 2023.
- C. Teixeira Lopes and C. Ribeiro, “Interplay of Documents’ Readability, Comprehension and Consumer Health Search Performance Across Query Terminology,” in Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, CHIIR 2019, Glasgow, Scotland, UK, March 10-14, 2019, 2019, pp. 193–201.
- Chia-Hsuan Chang (8 papers)
- Lei Wang (975 papers)
- Christopher C. Yang (10 papers)