Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Questionnaires for Everyone: Streamlining Cross-Cultural Questionnaire Adaptation with GPT-Based Translation Quality Evaluation (2407.20608v1)

Published 30 Jul 2024 in cs.HC and cs.CL

Abstract: Adapting questionnaires to new languages is a resource-intensive process often requiring the hiring of multiple independent translators, which limits the ability of researchers to conduct cross-cultural research and effectively creates inequalities in research and society. This work presents a prototype tool that can expedite the questionnaire translation process. The tool incorporates forward-backward translation using DeepL alongside GPT-4-generated translation quality evaluations and improvement suggestions. We conducted two online studies in which participants translated questionnaires from English to either German (Study 1; n=10) or Portuguese (Study 2; n=20) using our prototype. To evaluate the quality of the translations created using the tool, evaluation scores between conventionally translated and tool-supported versions were compared. Our results indicate that integrating LLM-generated translation quality evaluations and suggestions for improvement can help users independently attain results similar to those provided by conventional, non-NLP-supported translation methods. This is the first step towards more equitable questionnaire-based research, powered by AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Validated Screening Tools for Common Mental Disorders in Low and Middle Income Countries: A Systematic Review. PLOS ONE 11, 6 (June 2016), e0156939. https://doi.org/10.1371/journal.pone.0156939
  2. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Scientific Reports 13, 1 (Oct. 2023), 16492. https://doi.org/10.1038/s41598-023-43436-9
  3. Richard W. Brislin. 1970. Back-Translation for Cross-Cultural Research. Journal of Cross-Cultural Psychology 1, 3 (Sept. 1970), 185–216. https://doi.org/10.1177/135910457000100301 Number: 3.
  4. John Brooke. 1996. SUS: A ’Quick and Dirty’ Usability Scale. In Usability Evaluation In Industry (1 ed.), Patrick W. Jordan, B. Thomas, Ian Lyall McClelland, and Bernard Weerdmeester (Eds.). CRC Press. https://doi.org/10.1201/9781498710411
  5. Psychometric Evaluation of Chinese-Language 44-Item and 10-Item Big Five Personality Inventories, Including Correlations with Chronotype, Mindfulness and Mind Wandering. PLOS ONE 11, 2 (Feb. 2016), e0149963. https://doi.org/10.1371/journal.pone.0149963
  6. Translation of scales in cross‐cultural research: issues and techniques. Journal of Advanced Nursing 58, 4 (May 2007), 386–395. https://doi.org/10.1111/j.1365-2648.2007.04242.x Number: 4.
  7. Validation française du Big Five Inventory à 10 items (BFI-10). L’Encéphale 46, 6 (Dec. 2020), 455–462. https://doi.org/10.1016/j.encep.2020.02.006
  8. UX Evaluation with Standardized Questionnaires in Ubiquitous Computing and Ambient Intelligence: A Systematic Literature Review. Advances in Human-Computer Interaction 2021 (May 2021), 1–22. https://doi.org/10.1155/2021/5518722
  9. A review of guidelines for cross-cultural adaptation of questionnaires could not bring out a consensus. Journal of Clinical Epidemiology 68, 4 (April 2015), 435–441. https://doi.org/10.1016/j.jclinepi.2014.11.021 Number: 4.
  10. Luciano Floridi and Massimo Chiriatti. 2020. GPT-3: Its Nature, Scope, Limits, and Consequences. Minds and Machines 30, 4 (Dec. 2020), 681–694. https://doi.org/10.1007/s11023-020-09548-1
  11. Predicting Visual Importance Across Graphic Design Types. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. ACM, Virtual Event USA, 249–260. https://doi.org/10.1145/3379337.3415825
  12. A Personal Resource for Technology Interaction: Development and Validation of the Affinity for Technology Interaction (ATI) Scale. International Journal of Human–Computer Interaction 35, 6 (April 2019), 456–467. https://doi.org/10.1080/10447318.2018.1456150
  13. Overview of the Transformer-based Models for NLP Tasks. 179–183. https://doi.org/10.15439/2020F20
  14. The Finnish Version of the Affinity for Technology Interaction (ATI) Scale: Psychometric Properties and an Examination of Gender Differences. International Journal of Human–Computer Interaction 39, 4 (Feb. 2023), 874–892. https://doi.org/10.1080/10447318.2022.2049142 Number: 4.
  15. Mapping the availability of translated versions of posttraumatic stress disorder screening questionnaires for adults: A scoping review. European Journal of Psychotraumatology 13, 2 (Dec. 2022), 2143019. https://doi.org/10.1080/20008066.2022.2143019
  16. Juliane House. 2014. Translation Quality Assessment: Past and Present. In Translation: A Multidisciplinary Approach, Juliane House (Ed.). Palgrave Macmillan UK, London, 241–264. https://doi.org/10.1057/9781137025487_13
  17. Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. ACM, Hamburg Germany, 1–19. https://doi.org/10.1145/3544548.3580688
  18. Mate Kapović. 2017. Indo-European languages – introduction. In The Indo-European Languages (2 ed.), Mate Kapović, Anna Giancalone Ramat, and Paolo Ramat (Eds.). Routledge, 1–9. https://doi.org/10.4324/9781315678559
  19. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences 103 (April 2023), 102274. https://doi.org/10.1016/j.lindif.2023.102274
  20. Tom Kocmi and Christian Federmann. 2023. Large Language Models Are State-of-the-Art Evaluators of Translation Quality. (2023). https://doi.org/10.48550/ARXIV.2302.14520 Publisher: arXiv Version Number: 2.
  21. Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT. (2023). https://doi.org/10.48550/ARXIV.2303.13809 Publisher: arXiv Version Number: 2.
  22. Jakob Nielsen. 1994. Enhancing the explanatory power of usability heuristics. In Proceedings of the SIGCHI conference on Human factors in computing systems celebrating interdependence - CHI ’94. ACM Press, Boston, Massachusetts, United States, 152–158. https://doi.org/10.1145/191666.191729
  23. GPT-4 Technical Report. (2023). https://doi.org/10.48550/ARXIV.2303.08774 Publisher: arXiv Version Number: 4.
  24. Aalto Interface Metrics (AIM): A Service and Codebase for Computational GUI Evaluation. In Adjunct Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. ACM, Berlin Germany, 16–19. https://doi.org/10.1145/3266037.3266087
  25. Cross‐cultural translation and adaptation of the Danish version of the brief version of the 10‐item Big Five Inventory. Physiotherapy Research International 28, 3 (July 2023), e2004. https://doi.org/10.1002/pri.2004
  26. Beatrice Rammstedt and Oliver P. John. 2007. Measuring personality in one minute or less: A 10-item short version of the Big Five Inventory in English and German. Journal of Research in Personality 41, 1 (Feb. 2007), 203–212. https://doi.org/10.1016/j.jrp.2006.02.001
  27. Simulating the Human in HCD with ChatGPT: Redesigning Interaction Design with AI. Interactions 31, 1 (Jan. 2024), 24–31. https://doi.org/10.1145/3637436
  28. Design and Evaluation of a Short Version of the User Experience Questionnaire (UEQ-S). International Journal of Interactive Multimedia and Artificial Intelligence 4, 6 (2017), 103. https://doi.org/10.9781/ijimai.2017.09.001 Number: 6.
  29. Andrew Shepherd. 1998. HTA as a framework for task analysis. Ergonomics 41, 11 (Nov. 1998), 1537–1552. https://doi.org/10.1080/001401398186063 Number: 11.
  30. Chia-Ting Su and L. Diane Parham. 2002. Generating a Valid Questionnaire Translation for Cross-Cultural Use. The American Journal of Occupational Therapy 56, 5 (Sept. 2002), 581–585. https://doi.org/10.5014/ajot.56.5.581 Number: 5.
  31. Hamed Taherdoost. 2016. Validity and Reliability of the Research Instrument; How to Test the Validation of a Questionnaire/Survey in a Research. SSRN Electronic Journal (2016). https://doi.org/10.2139/ssrn.3205040
  32. Validation of the Reliability of Machine Translation for a Medical Article From Japanese to English Using DeepL Translator. Cureus (Sept. 2021). https://doi.org/10.7759/cureus.17778
  33. Language Models Can Generate Human-Like Self-Reports of Emotion. In 27th International Conference on Intelligent User Interfaces. ACM, Helsinki Finland, 69–72. https://doi.org/10.1145/3490100.3516464
  34. Establishing Content Validity of the CLEFT-Q: A New Patient-reported Outcome Instrument for Cleft Lip/Palate. Plastic and Reconstructive Surgery - Global Open 5, 4 (April 2017), e1305. https://doi.org/10.1097/GOX.0000000000001305
  35. Finnish Translation and Linguistic Validation of the CLEFT-Q Questionnaire. The Cleft Palate Craniofacial Journal (March 2023), 105566562311624. https://doi.org/10.1177/10556656231162454
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Otso Haavisto (2 papers)
  2. Robin Welsch (11 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets