Mapping the Challenges of HCI: An Application and Evaluation of ChatGPT and GPT-4 for Mining Insights at Scale (2306.05036v4)
Abstract: LLMs, such as ChatGPT and GPT-4, are gaining wide-spread real world use. Yet, these LLMs are closed source, and little is known about their performance in real-world use cases. In this paper, we apply and evaluate the combination of ChatGPT and GPT-4 for the real-world task of mining insights from a text corpus in order to identify research challenges in the field of HCI. We extract 4,392 research challenges in over 100 topics from the 2023~CHI conference proceedings and visualize the research challenges for interactive exploration. We critically evaluate the LLMs on this practical task and conclude that the combination of ChatGPT and GPT-4 makes an excellent cost-efficient means for analyzing a text corpus at scale. Cost-efficiency is key for flexibly prototyping research ideas and analyzing text corpora from different perspectives, with implications for applying LLMs for mining insights in academia and practice.
- Stability AI. 2023. Meet Stable Beluga 1 and Stable Beluga 2, Our Large and Mighty Instruction Fine-Tuned Language Models. https://stability.ai/blog/stable-beluga-large-instruction-fine-tuned-models
- Falcon-40B: An open large language model with state-of-the-art performance. https://huggingface.co/tiiuae/falcon-40b
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, 65–72. https://www.aclweb.org/anthology/W05-0909
- Open Information Extraction from the Web. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI ’07). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2670–2676.
- On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922
- Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. arXiv:2304.01373 [cs.CL]
- On the Opportunities and Risks of Foundation Models. ArXiv (2021), 214 pages. https://crfm.stanford.edu/assets/report.pdf
- Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3, 2 (2006), 77–101. https://doi.org/10.1191/1478088706qp063oa
- Virginia Braun and Victoria Clarke. 2019. Reflecting on reflexive thematic analysis. Qualitative Research in Sport, Exercise and Health 11, 4 (2019), 589–597. https://doi.org/10.1080/2159676X.2019.1628806
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901.
- Sparks of Artificial General Intelligence: Early experiments with GPT-4. https://doi.org/10.48550/arXiv.2303.12712
- Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 291–305. https://doi.org/10.18653/v1/2022.emnlp-main.20
- Extracting Training Data from Large Language Models. In Proceedings of the 30th USENIX Security Symposium. USENIX Association, 2633–2650.
- Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc.
- Andy Coenen and Adam Pearce. [n. d.]. Understanding UMAP. https://pair-code.github.io/understanding-umap/
- Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement 20, 1 (1960), 37–46. https://doi.org/10.1177/001316446002000104
- Cohere Team. 2022. LLM Parameters Demystified: Getting The Best Outputs from Language AI. https://txt.cohere.com/llm-parameters-best-outputs-language-ai/
- Antonia Creswell and Murray Shanahan. 2022. Faithful Reasoning Using Large Language Models. https://doi.org/10.48550/arXiv.2208.14271
- Chris Cundy and Stefano Ermon. 2023. SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking. arXiv:2306.05426
- Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers. https://doi.org/10.48550/arXiv.2212.10559
- Luciano Del Corro and Rainer Gemulla. 2013. ClausIE: Clause-Based Open Information Extraction. In Proceedings of the 22nd International Conference on World Wide Web (WWW ’13). Association for Computing Machinery, New York, NY, USA, 355–366. https://doi.org/10.1145/2488388.2488420
- Peter J. Denning. 2023. Can Generative AI Bots Be Trusted? Commun. ACM 66, 6 (may 2023), 24–27. https://doi.org/10.1145/3592981
- Nicole M. Deterding and Mary C. Waters. 2021. Flexible Coding of In-depth Interviews: A Twenty-first-century Approach. Sociological Methods & Research 50, 2 (2021), 708–739. https://doi.org/10.1177/0049124118799377
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
- Ravit Dotan and Smitha Milli. 2020. Value-Laden Disciplinary Shifts in Machine Learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (FAT* ’20). Association for Computing Machinery, New York, NY, USA, 294. https://doi.org/10.1145/3351095.3373157
- Viewpoint Diversity in Search Results. In Advances in Information Retrieval, Jaap Kamps, Lorraine Goeuriot, Fabio Crestani, Maria Maistro, Hideo Joho, Brian Davis, Cathal Gurrin, Udo Kruschwitz, and Annalina Caputo (Eds.). Springer Nature Switzerland, Cham, 279–297.
- AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. https://tatsu-lab.github.io/alpaca_farm_paper.pdf
- Subhabrata Dutta and Tanmoy Chakraborty. 2023. Thus Spake ChatGPT. Commun. ACM 66, 12 (2023), 16–19. https://doi.org/10.1145/3616863
- The KDD Process for Extracting Useful Knowledge from Volumes of Data. Commun. ACM 39, 11 (nov 1996), 27–34. https://doi.org/10.1145/240455.240464
- CollabCoder: A GPT-Powered Workflow for Collaborative Qualitative Analysis. https://doi.org/10.48550/arXiv.2304.07366
- Felt Ethics: Cultivating Ethical Sensibility in Design Practice. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 1, 15 pages. https://doi.org/10.1145/3544548.3580875
- How HCI Bridges Health and Design in Online Health Communities: A Systematic Review. In Proceedings of the 2021 ACM Designing Interactive Systems Conference (DIS ’21). Association for Computing Machinery, New York, NY, USA, 970–983. https://doi.org/10.1145/3461778.3462100
- Barney G. Glaser and Anselm L. Strauss. 1967. The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine de Gruyter, New York, NY.
- Think before you speak: Training Language Models With Pause Tokens. arXiv:2310.02226
- Maarten Grootendorst. 2020. c-TF-IDF. https://github.com/MaartenGr/cTFIDF
- Maarten Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv:2203.05794
- Som Gupta and S. K Gupta. 2019. Abstractive summarization: An overview of the state of the art. Expert Systems with Applications 121 (2019), 49–65. https://doi.org/10.1016/j.eswa.2018.12.011
- Marking Material Interactions with Computer Vision. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 478, 17 pages. https://doi.org/10.1145/3544548.3580643
- Evaluating Large Language Models in Generating Synthetic HCI Research Data: A Case Study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 433, 19 pages. https://doi.org/10.1145/3544548.3580688
- Rethinking with Retrieval: Faithful Large Language Model Inference. https://doi.org/10.48550/arXiv.2301.00303
- Less is Not More: Improving Findability and Actionability of Privacy Controls for Online Behavioral Advertising. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 661, 33 pages. https://doi.org/10.1145/3544548.3580773
- Co-Writing with Opinionated Language Models Affects Users’ Views. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 111, 15 pages. https://doi.org/10.1145/3544548.3581196
- Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 55, 12, Article 248 (mar 2023), 38 pages. https://doi.org/10.1145/3571730
- Understanding the Benefits and Challenges of Deploying Conversational AI Leveraging Large Language Models for Public Health Intervention. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 18, 16 pages. https://doi.org/10.1145/3544548.3581503
- The Ghost in the Machine has an American accent: value conflict in GPT-3. https://doi.org/10.48550/arXiv.2203.07785 arXiv:2203.07785 [cs.CL]
- Autospeculation: Reflecting on the Intimate and Imaginative Capacities of Data Analysis. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 151, 10 pages. https://doi.org/10.1145/3544548.3580902
- Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
- Large Language Models are Zero-Shot Reasoners. In ICML 2022 Workshop on Knowledge Retrieval and Language Models (ICML 2022). https://openreview.net/forum?id=6p3AuaHAFiN
- Can Pretrained Language Models Generate Persuasive, Faithful, and Informative Ad Text for Product Descriptions?. In Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5). Association for Computational Linguistics, Dublin, Ireland, 234–243. https://doi.org/10.18653/v1/2022.ecnlp-1.27
- Thomas Rachel L. and Uminsky David. 2022. Reliance on metrics is a fundamental challenge for AI. Patterns (N.Y.) 3, 5 (2022), 8 pages. https://doi.org/10.1016/j.patter.2022.100476
- Can language models learn from explanations in context?. In Findings of the Association for Computational Linguistics (EMNLP 2022). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 537–563.
- Evaluating ChatGPT’s Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness. https://doi.org/10.48550/arXiv.2304.11633
- AlpacaEval: An Automatic Evaluator of Instruction-following Models. https://github.com/tatsu-lab/alpaca_eval.
- Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
- Decoding Prompt Syntax: Analysing Its Impact on Knowledge Retrieval in Large Language Models. In Companion Proceedings of the ACM Web Conference 2023 (Austin, TX, USA) (WWW ’23 Companion). Association for Computing Machinery, New York, NY, USA, 1145–1149. https://doi.org/10.1145/3543873.3587655
- CHI 1994-2013: Mapping Two Decades of Intellectual Progress through Co-Word Analysis. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Toronto, Ontario, Canada) (CHI ’14). Association for Computing Machinery, New York, NY, USA, 3553–3562. https://doi.org/10.1145/2556288.2556969
- Reference-free Summarization Evaluation via Semantic Correlation and Compression Ratio. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 2109–2115. https://doi.org/10.18653/v1/2022.naacl-main.153
- Alejandro Lopez-Lira and Yuehua Tang. 2023. Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models. Return Predictability and Large Language Models (6 Apr 2023).
- Mausam. 2016. Open Information Extraction Systems and Downstream Applications. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI ’16). AAAI Press, 4074–4077.
- Open Language Learning for Information Extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, Jeju Island, Korea, 523–534. https://aclanthology.org/D12-1048
- On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 1906–1919. https://doi.org/10.18653/v1/2020.acl-main.173
- Reliability and Inter-Rater Reliability in Qualitative Research: Norms and Guidelines for CSCW and HCI Practice. Proc. ACM Hum.-Comput. Interact. 3, CSCW, Article 72 (nov 2019), 23 pages. https://doi.org/10.1145/3359174
- HDBSCAN: Hierarchical density based clustering. J. Open Source Softw. 2, 11 (2017), 205.
- UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software 3, 29 (2018), 861. https://doi.org/10.21105/joss.00861
- Yohei Nakajima. 2023. babyagi. https://github.com/yoheinakajima/babyagi
- Ikujiro Nonaka. 1994. A Dynamic Theory of Organizational Knowledge Creation. Organization Science 5 (1994), 14–37. https://api.semanticscholar.org/CorpusID:17219859
- In-context Learning and Induction Heads. https://doi.org/10.48550/arXiv.2209.11895
- OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt
- OpenAI. 2023. GPT-4 Technical Report. https://doi.org/10.48550/arXiv.2303.08774
- OpenAI. n.d.. Text completion. https://platform.openai.com/docs/guides/completion/prompt-design
- Jonas Oppenlaender. 2022. The Creativity of Text-to-Image Generation. In 25th International Academic Mindtrek Conference (Tampere, Finland) (Academic Mindtrek 2022). Association for Computing Machinery, New York, NY, USA, 192–202. https://doi.org/10.1145/3569219.3569352
- Jonas Oppenlaender. 2023. A Taxonomy of Prompt Modifiers for Text-To-Image Generation. Behaviour & Information Technology (2023), 1–14. https://doi.org/10.1080/0144929X.2023.2286532
- Jonas Oppenlaender and Simo Hosio. 2019. Design Recommendations for Augmenting Creative Tasks with Computational Priming. In Proceedings of the 18th International Conference on Mobile and Ubiquitous Multimedia (MUM ’19). ACM, New York, NY, USA, Article 35, 13 pages. https://doi.org/10.1145/3365610.3365621
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). MIT Press.
- BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL ’02). Association for Computational Linguistics, USA, 311–318. https://doi.org/10.3115/1073083.1073135
- Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442 [cs.HC]
- Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns 2, 11 (2021), 100336. https://doi.org/10.1016/j.patter.2021.100336
- Discovering Language Model Behaviors with Model-Written Evaluations. In Findings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 13387–13434. https://doi.org/10.18653/v1/2023.findings-acl.847
- Fábio Perez and Ian Ribeiro. 2022. Ignore Previous Prompt: Attack Techniques For Language Models. arXiv:2211.09527
- AngleKindling: Supporting Journalistic Angle Ideation with Large Language Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 225, 16 pages. https://doi.org/10.1145/3544548.3580907
- Is ChatGPT a General-Purpose Natural Language Processing Task Solver? https://doi.org/10.48550/arXiv.2302.06476
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748–8763.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21, 1, Article 140 (jan 2020), 67 pages.
- Supporting Human-AI Collaboration in Auditing LLMs with LLMs. arXiv:2304.09991
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982–3992. https://doi.org/10.18653/v1/D19-1410
- Toran Bruce Richards. 2023. Auto-GPT: An experimental open-source attempt to make GPT-4 fully autonomous. https://github.com/Significant-Gravitas/Auto-GPT
- In-Context Impersonation Reveals Large Language Models’ Strengths and Biases. https://doi.org/10.48550/arXiv.2305.14930
- Towards Understanding Sycophancy in Language Models. arXiv:2310.13548
- Societal Biases in Language Generation: Progress and Challenges. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 4275–4293. https://doi.org/10.18653/v1/2021.acl-long.330
- Jessica Shieh. 2023. Best practices for prompt engineering with OpenAI API. https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api
- Grand Challenges for HCI Researchers. Interactions 23, 5 (aug 2016), 24–25. https://doi.org/10.1145/2977645
- Prompting GPT-3 To Be Reliable. In The 11th International Conference on Learning Representations (ICLR ’23).
- Philip L. Smith and Daniel R. Little. 2018. Small is beautiful: In defense of the small-N design. Psychonomic Bulletin & Review 25 (2018), 2083–2101. https://doi.org/10.3758/s13423-018-1451-8
- DeepLens: Interactive Out-of-Distribution Data Detection in NLP Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 739, 17 pages. https://doi.org/10.1145/3544548.3580741
- Stability AI. 2023. StableLM: Stability AI Language Models. https://github.com/Stability-AI/StableLM
- Literature Reviews in HCI: A Review of Reviews. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (¡conf-loc¿, ¡city¿Hamburg¡/city¿, ¡country¿Germany¡/country¿, ¡/conf-loc¿) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 509, 24 pages. https://doi.org/10.1145/3544548.3581332
- Seven HCI Grand Challenges. International Journal of Human–Computer Interaction 35, 14 (2019), 1229–1269. https://doi.org/10.1080/10447318.2019.1619259
- Kaleidoscope: Semantically-Grounded, Context-Specific ML Model Evaluation. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 775, 13 pages. https://doi.org/10.1145/3544548.3581482
- Embodying Physics-Aware Avatars in Virtual Reality. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 254, 15 pages. https://doi.org/10.1145/3544548.3580979
- Josh Tobin. 2023. LLMOps: Deployment and Learning in Production. https://www.youtube.com/watch?v=Fquj2u7ay40
- LLaMA: Open and Efficient Foundation Language Models. https://doi.org/10.48550/arXiv.2302.13971
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288
- United Nations. 2015. Transforming our World: the 2030 Agenda for Sustainable Development. United Nations General Assembly. https://sdgs.un.org/2030agenda
- Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605.
- The Birth of Bias: A case study on the evolution of gender bias in an English language model. CoRR abs/2207.10245 (2022). https://doi.org/10.48550/arXiv.2207.10245
- Marieke van Erp and Victor de Boer. 2021. A Polyvocal and Contextualised Semantic Web. In The Semantic Web: 18th International Conference, ESWC 2021, Virtual Event, June 6–10, 2021, Proceedings. Springer, Berlin, Heidelberg, 506–512. https://doi.org/10.1007/978-3-030-77385-4_30
- Altair: Interactive Statistical Visualizations for Python. Journal of Open Source Software 3, 32 (2018), 1057. https://doi.org/10.21105/joss.01057
- Fill in the BLANC: Human-free quality estimation of document summaries. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems. Association for Computational Linguistics, Online, 11–20. https://doi.org/10.18653/v1/2020.eval4nlp-1.2
- ChatGPT for Robotics: Design Principles and Model Abilities. Technical Report MSR-TR-2023-8. Microsoft. https://www.microsoft.com/en-us/research/publication/chatgpt-for-robotics-design-principles-and-model-abilities/
- Ellen Voorhees. 2000. The TREC-8 Question Answering Track Report. https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=151495
- Humanoid Agents: Platform for Simulating Human-like Generative Agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP ’23). arXiv:2310.05418
- Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations (ICLR ’22).
- Supporting Qualitative Analysis with Large Language Models: Combining Codebook with GPT-3 for Deductive Coding. In Companion Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI ’23 Companion). Association for Computing Machinery, New York, NY, USA, 75–78. https://doi.org/10.1145/3581754.3584136
- Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. arXiv:2304.13712
- Harnessing Biomedical Literature to Calibrate Clinicians’ Trust in AI Decision Support Systems. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 14, 14 pages. https://doi.org/10.1145/3544548.3581393
- The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). arXiv:2309.17421
- How Language Model Hallucinations Can Snowball. https://doi.org/10.48550/arXiv.2305.13534 arXiv:2305.13534
- BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations (ICLR ’20).
- ConceptEVA: Concept-Based Interactive Exploration and Customization of Document Summaries. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 204, 16 pages. https://doi.org/10.1145/3544548.3581260
- Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, Copenhagen, Denmark, 2979–2989. https://doi.org/10.18653/v1/D17-1323