Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mapping the Challenges of HCI: An Application and Evaluation of ChatGPT and GPT-4 for Mining Insights at Scale (2306.05036v4)

Published 8 Jun 2023 in cs.HC and cs.AI

Abstract: LLMs, such as ChatGPT and GPT-4, are gaining wide-spread real world use. Yet, these LLMs are closed source, and little is known about their performance in real-world use cases. In this paper, we apply and evaluate the combination of ChatGPT and GPT-4 for the real-world task of mining insights from a text corpus in order to identify research challenges in the field of HCI. We extract 4,392 research challenges in over 100 topics from the 2023~CHI conference proceedings and visualize the research challenges for interactive exploration. We critically evaluate the LLMs on this practical task and conclude that the combination of ChatGPT and GPT-4 makes an excellent cost-efficient means for analyzing a text corpus at scale. Cost-efficiency is key for flexibly prototyping research ideas and analyzing text corpora from different perspectives, with implications for applying LLMs for mining insights in academia and practice.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (120)
  1. Stability AI. 2023. Meet Stable Beluga 1 and Stable Beluga 2, Our Large and Mighty Instruction Fine-Tuned Language Models. https://stability.ai/blog/stable-beluga-large-instruction-fine-tuned-models
  2. Falcon-40B: An open large language model with state-of-the-art performance. https://huggingface.co/tiiuae/falcon-40b
  3. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, 65–72. https://www.aclweb.org/anthology/W05-0909
  4. Open Information Extraction from the Web. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI ’07). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2670–2676.
  5. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922
  6. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. arXiv:2304.01373 [cs.CL]
  7. On the Opportunities and Risks of Foundation Models. ArXiv (2021), 214 pages. https://crfm.stanford.edu/assets/report.pdf
  8. Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3, 2 (2006), 77–101. https://doi.org/10.1191/1478088706qp063oa
  9. Virginia Braun and Victoria Clarke. 2019. Reflecting on reflexive thematic analysis. Qualitative Research in Sport, Exercise and Health 11, 4 (2019), 589–597. https://doi.org/10.1080/2159676X.2019.1628806
  10. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901.
  11. Sparks of Artificial General Intelligence: Early experiments with GPT-4. https://doi.org/10.48550/arXiv.2303.12712
  12. Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 291–305. https://doi.org/10.18653/v1/2022.emnlp-main.20
  13. Extracting Training Data from Large Language Models. In Proceedings of the 30th USENIX Security Symposium. USENIX Association, 2633–2650.
  14. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc.
  15. Andy Coenen and Adam Pearce. [n. d.]. Understanding UMAP. https://pair-code.github.io/understanding-umap/
  16. Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement 20, 1 (1960), 37–46. https://doi.org/10.1177/001316446002000104
  17. Cohere Team. 2022. LLM Parameters Demystified: Getting The Best Outputs from Language AI. https://txt.cohere.com/llm-parameters-best-outputs-language-ai/
  18. Antonia Creswell and Murray Shanahan. 2022. Faithful Reasoning Using Large Language Models. https://doi.org/10.48550/arXiv.2208.14271
  19. Chris Cundy and Stefano Ermon. 2023. SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking. arXiv:2306.05426
  20. Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers. https://doi.org/10.48550/arXiv.2212.10559
  21. Luciano Del Corro and Rainer Gemulla. 2013. ClausIE: Clause-Based Open Information Extraction. In Proceedings of the 22nd International Conference on World Wide Web (WWW ’13). Association for Computing Machinery, New York, NY, USA, 355–366. https://doi.org/10.1145/2488388.2488420
  22. Peter J. Denning. 2023. Can Generative AI Bots Be Trusted? Commun. ACM 66, 6 (may 2023), 24–27. https://doi.org/10.1145/3592981
  23. Nicole M. Deterding and Mary C. Waters. 2021. Flexible Coding of In-depth Interviews: A Twenty-first-century Approach. Sociological Methods & Research 50, 2 (2021), 708–739. https://doi.org/10.1177/0049124118799377
  24. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  25. Ravit Dotan and Smitha Milli. 2020. Value-Laden Disciplinary Shifts in Machine Learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (FAT* ’20). Association for Computing Machinery, New York, NY, USA, 294. https://doi.org/10.1145/3351095.3373157
  26. Viewpoint Diversity in Search Results. In Advances in Information Retrieval, Jaap Kamps, Lorraine Goeuriot, Fabio Crestani, Maria Maistro, Hideo Joho, Brian Davis, Cathal Gurrin, Udo Kruschwitz, and Annalina Caputo (Eds.). Springer Nature Switzerland, Cham, 279–297.
  27. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. https://tatsu-lab.github.io/alpaca_farm_paper.pdf
  28. Subhabrata Dutta and Tanmoy Chakraborty. 2023. Thus Spake ChatGPT. Commun. ACM 66, 12 (2023), 16–19. https://doi.org/10.1145/3616863
  29. The KDD Process for Extracting Useful Knowledge from Volumes of Data. Commun. ACM 39, 11 (nov 1996), 27–34. https://doi.org/10.1145/240455.240464
  30. CollabCoder: A GPT-Powered Workflow for Collaborative Qualitative Analysis. https://doi.org/10.48550/arXiv.2304.07366
  31. Felt Ethics: Cultivating Ethical Sensibility in Design Practice. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 1, 15 pages. https://doi.org/10.1145/3544548.3580875
  32. How HCI Bridges Health and Design in Online Health Communities: A Systematic Review. In Proceedings of the 2021 ACM Designing Interactive Systems Conference (DIS ’21). Association for Computing Machinery, New York, NY, USA, 970–983. https://doi.org/10.1145/3461778.3462100
  33. Barney G. Glaser and Anselm L. Strauss. 1967. The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine de Gruyter, New York, NY.
  34. Think before you speak: Training Language Models With Pause Tokens. arXiv:2310.02226
  35. Maarten Grootendorst. 2020. c-TF-IDF. https://github.com/MaartenGr/cTFIDF
  36. Maarten Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv:2203.05794
  37. Som Gupta and S. K Gupta. 2019. Abstractive summarization: An overview of the state of the art. Expert Systems with Applications 121 (2019), 49–65. https://doi.org/10.1016/j.eswa.2018.12.011
  38. Marking Material Interactions with Computer Vision. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 478, 17 pages. https://doi.org/10.1145/3544548.3580643
  39. Evaluating Large Language Models in Generating Synthetic HCI Research Data: A Case Study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 433, 19 pages. https://doi.org/10.1145/3544548.3580688
  40. Rethinking with Retrieval: Faithful Large Language Model Inference. https://doi.org/10.48550/arXiv.2301.00303
  41. Less is Not More: Improving Findability and Actionability of Privacy Controls for Online Behavioral Advertising. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 661, 33 pages. https://doi.org/10.1145/3544548.3580773
  42. Co-Writing with Opinionated Language Models Affects Users’ Views. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 111, 15 pages. https://doi.org/10.1145/3544548.3581196
  43. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 55, 12, Article 248 (mar 2023), 38 pages. https://doi.org/10.1145/3571730
  44. Understanding the Benefits and Challenges of Deploying Conversational AI Leveraging Large Language Models for Public Health Intervention. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 18, 16 pages. https://doi.org/10.1145/3544548.3581503
  45. The Ghost in the Machine has an American accent: value conflict in GPT-3. https://doi.org/10.48550/arXiv.2203.07785 arXiv:2203.07785 [cs.CL]
  46. Autospeculation: Reflecting on the Intimate and Imaginative Capacities of Data Analysis. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 151, 10 pages. https://doi.org/10.1145/3544548.3580902
  47. Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  48. Large Language Models are Zero-Shot Reasoners. In ICML 2022 Workshop on Knowledge Retrieval and Language Models (ICML 2022). https://openreview.net/forum?id=6p3AuaHAFiN
  49. Can Pretrained Language Models Generate Persuasive, Faithful, and Informative Ad Text for Product Descriptions?. In Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5). Association for Computational Linguistics, Dublin, Ireland, 234–243. https://doi.org/10.18653/v1/2022.ecnlp-1.27
  50. Thomas Rachel L. and Uminsky David. 2022. Reliance on metrics is a fundamental challenge for AI. Patterns (N.Y.) 3, 5 (2022), 8 pages. https://doi.org/10.1016/j.patter.2022.100476
  51. Can language models learn from explanations in context?. In Findings of the Association for Computational Linguistics (EMNLP 2022). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 537–563.
  52. Evaluating ChatGPT’s Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness. https://doi.org/10.48550/arXiv.2304.11633
  53. AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca_eval.
  54. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
  55. Decoding Prompt Syntax: Analysing Its Impact on Knowledge Retrieval in Large Language Models. In Companion Proceedings of the ACM Web Conference 2023 (Austin, TX, USA) (WWW ’23 Companion). Association for Computing Machinery, New York, NY, USA, 1145–1149. https://doi.org/10.1145/3543873.3587655
  56. CHI 1994-2013: Mapping Two Decades of Intellectual Progress through Co-Word Analysis. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Toronto, Ontario, Canada) (CHI ’14). Association for Computing Machinery, New York, NY, USA, 3553–3562. https://doi.org/10.1145/2556288.2556969
  57. Reference-free Summarization Evaluation via Semantic Correlation and Compression Ratio. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 2109–2115. https://doi.org/10.18653/v1/2022.naacl-main.153
  58. Alejandro Lopez-Lira and Yuehua Tang. 2023. Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models. Return Predictability and Large Language Models (6 Apr 2023).
  59. Mausam. 2016. Open Information Extraction Systems and Downstream Applications. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI ’16). AAAI Press, 4074–4077.
  60. Open Language Learning for Information Extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, Jeju Island, Korea, 523–534. https://aclanthology.org/D12-1048
  61. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 1906–1919. https://doi.org/10.18653/v1/2020.acl-main.173
  62. Reliability and Inter-Rater Reliability in Qualitative Research: Norms and Guidelines for CSCW and HCI Practice. Proc. ACM Hum.-Comput. Interact. 3, CSCW, Article 72 (nov 2019), 23 pages. https://doi.org/10.1145/3359174
  63. HDBSCAN: Hierarchical density based clustering. J. Open Source Softw. 2, 11 (2017), 205.
  64. UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software 3, 29 (2018), 861. https://doi.org/10.21105/joss.00861
  65. Yohei Nakajima. 2023. babyagi. https://github.com/yoheinakajima/babyagi
  66. Ikujiro Nonaka. 1994. A Dynamic Theory of Organizational Knowledge Creation. Organization Science 5 (1994), 14–37. https://api.semanticscholar.org/CorpusID:17219859
  67. In-context Learning and Induction Heads. https://doi.org/10.48550/arXiv.2209.11895
  68. OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt
  69. OpenAI. 2023. GPT-4 Technical Report. https://doi.org/10.48550/arXiv.2303.08774
  70. OpenAI. n.d.. Text completion. https://platform.openai.com/docs/guides/completion/prompt-design
  71. Jonas Oppenlaender. 2022. The Creativity of Text-to-Image Generation. In 25th International Academic Mindtrek Conference (Tampere, Finland) (Academic Mindtrek 2022). Association for Computing Machinery, New York, NY, USA, 192–202. https://doi.org/10.1145/3569219.3569352
  72. Jonas Oppenlaender. 2023. A Taxonomy of Prompt Modifiers for Text-To-Image Generation. Behaviour & Information Technology (2023), 1–14. https://doi.org/10.1080/0144929X.2023.2286532
  73. Jonas Oppenlaender and Simo Hosio. 2019. Design Recommendations for Augmenting Creative Tasks with Computational Priming. In Proceedings of the 18th International Conference on Mobile and Ubiquitous Multimedia (MUM ’19). ACM, New York, NY, USA, Article 35, 13 pages. https://doi.org/10.1145/3365610.3365621
  74. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). MIT Press.
  75. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL ’02). Association for Computational Linguistics, USA, 311–318. https://doi.org/10.3115/1073083.1073135
  76. Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442 [cs.HC]
  77. Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns 2, 11 (2021), 100336. https://doi.org/10.1016/j.patter.2021.100336
  78. Discovering Language Model Behaviors with Model-Written Evaluations. In Findings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 13387–13434. https://doi.org/10.18653/v1/2023.findings-acl.847
  79. Fábio Perez and Ian Ribeiro. 2022. Ignore Previous Prompt: Attack Techniques For Language Models. arXiv:2211.09527
  80. AngleKindling: Supporting Journalistic Angle Ideation with Large Language Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 225, 16 pages. https://doi.org/10.1145/3544548.3580907
  81. Is ChatGPT a General-Purpose Natural Language Processing Task Solver? https://doi.org/10.48550/arXiv.2302.06476
  82. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748–8763.
  83. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21, 1, Article 140 (jan 2020), 67 pages.
  84. Supporting Human-AI Collaboration in Auditing LLMs with LLMs. arXiv:2304.09991
  85. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982–3992. https://doi.org/10.18653/v1/D19-1410
  86. Toran Bruce Richards. 2023. Auto-GPT: An experimental open-source attempt to make GPT-4 fully autonomous. https://github.com/Significant-Gravitas/Auto-GPT
  87. In-Context Impersonation Reveals Large Language Models’ Strengths and Biases. https://doi.org/10.48550/arXiv.2305.14930
  88. Towards Understanding Sycophancy in Language Models. arXiv:2310.13548
  89. Societal Biases in Language Generation: Progress and Challenges. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 4275–4293. https://doi.org/10.18653/v1/2021.acl-long.330
  90. Jessica Shieh. 2023. Best practices for prompt engineering with OpenAI API. https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api
  91. Grand Challenges for HCI Researchers. Interactions 23, 5 (aug 2016), 24–25. https://doi.org/10.1145/2977645
  92. Prompting GPT-3 To Be Reliable. In The 11th International Conference on Learning Representations (ICLR ’23).
  93. Philip L. Smith and Daniel R. Little. 2018. Small is beautiful: In defense of the small-N design. Psychonomic Bulletin & Review 25 (2018), 2083–2101. https://doi.org/10.3758/s13423-018-1451-8
  94. DeepLens: Interactive Out-of-Distribution Data Detection in NLP Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 739, 17 pages. https://doi.org/10.1145/3544548.3580741
  95. Stability AI. 2023. StableLM: Stability AI Language Models. https://github.com/Stability-AI/StableLM
  96. Literature Reviews in HCI: A Review of Reviews. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (¡conf-loc¿, ¡city¿Hamburg¡/city¿, ¡country¿Germany¡/country¿, ¡/conf-loc¿) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 509, 24 pages. https://doi.org/10.1145/3544548.3581332
  97. Seven HCI Grand Challenges. International Journal of Human–Computer Interaction 35, 14 (2019), 1229–1269. https://doi.org/10.1080/10447318.2019.1619259
  98. Kaleidoscope: Semantically-Grounded, Context-Specific ML Model Evaluation. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 775, 13 pages. https://doi.org/10.1145/3544548.3581482
  99. Embodying Physics-Aware Avatars in Virtual Reality. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 254, 15 pages. https://doi.org/10.1145/3544548.3580979
  100. Josh Tobin. 2023. LLMOps: Deployment and Learning in Production. https://www.youtube.com/watch?v=Fquj2u7ay40
  101. LLaMA: Open and Efficient Foundation Language Models. https://doi.org/10.48550/arXiv.2302.13971
  102. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288
  103. United Nations. 2015. Transforming our World: the 2030 Agenda for Sustainable Development. United Nations General Assembly. https://sdgs.un.org/2030agenda
  104. Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605.
  105. The Birth of Bias: A case study on the evolution of gender bias in an English language model. CoRR abs/2207.10245 (2022). https://doi.org/10.48550/arXiv.2207.10245
  106. Marieke van Erp and Victor de Boer. 2021. A Polyvocal and Contextualised Semantic Web. In The Semantic Web: 18th International Conference, ESWC 2021, Virtual Event, June 6–10, 2021, Proceedings. Springer, Berlin, Heidelberg, 506–512. https://doi.org/10.1007/978-3-030-77385-4_30
  107. Altair: Interactive Statistical Visualizations for Python. Journal of Open Source Software 3, 32 (2018), 1057. https://doi.org/10.21105/joss.01057
  108. Fill in the BLANC: Human-free quality estimation of document summaries. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems. Association for Computational Linguistics, Online, 11–20. https://doi.org/10.18653/v1/2020.eval4nlp-1.2
  109. ChatGPT for Robotics: Design Principles and Model Abilities. Technical Report MSR-TR-2023-8. Microsoft. https://www.microsoft.com/en-us/research/publication/chatgpt-for-robotics-design-principles-and-model-abilities/
  110. Ellen Voorhees. 2000. The TREC-8 Question Answering Track Report. https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=151495
  111. Humanoid Agents: Platform for Simulating Human-like Generative Agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP ’23). arXiv:2310.05418
  112. Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations (ICLR ’22).
  113. Supporting Qualitative Analysis with Large Language Models: Combining Codebook with GPT-3 for Deductive Coding. In Companion Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23 Companion). Association for Computing Machinery, New York, NY, USA, 75–78. https://doi.org/10.1145/3581754.3584136
  114. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. arXiv:2304.13712
  115. Harnessing Biomedical Literature to Calibrate Clinicians’ Trust in AI Decision Support Systems. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 14, 14 pages. https://doi.org/10.1145/3544548.3581393
  116. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). arXiv:2309.17421
  117. How Language Model Hallucinations Can Snowball. https://doi.org/10.48550/arXiv.2305.13534 arXiv:2305.13534
  118. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations (ICLR ’20).
  119. ConceptEVA: Concept-Based Interactive Exploration and Customization of Document Summaries. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 204, 16 pages. https://doi.org/10.1145/3544548.3581260
  120. Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, Copenhagen, Denmark, 2979–2989. https://doi.org/10.18653/v1/D17-1323
Citations (6)

Summary

We haven't generated a summary for this paper yet.