Crystallizing Schemas with Teleoscope: Thematic Curation of Large Text Corpora (2402.06124v2)
Abstract: Making sense of large text corpora is difficult when scales reach thousands or millions of documents. With the advent of LLMs, the potential for large-scale sense-making is being realized. However, this presents a need for rigour in the data curation stage of thematic analysis: selecting the right documents to achieve appropriate information power (saturation) requires an auditable trace of researchers' thought processes. In this paper, we present methodological and design findings from a three-year design process where we worked with qualitative researchers to create an open-source platform called Teleoscope designed to rigorously curate documents at scale. By implementing the qualitative research values common to thematic analysis during the curation stage (which we call thematic curation), we found researchers could come to a shared understanding of a large corpus and feel confident in their curation decisions (which we call schema crystallization).
- [n. d.]. https://openai.com/blog/chatgpt
- Serendip: Topic model-driven visual exploration of text corpora. In 2014 IEEE Conference on Visual Analytics Science and Technology (VAST). 173–182. https://doi.org/10.1109/VAST.2014.7042493
- Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information fusion 58 (2020), 82–115.
- Impact of word embedding models on text analytics in deep learning environment: a review. Artificial Intelligence Review 56 (2023), 1–81.
- Recognizing user interest and document value from reading and organizing activities in document triage. In Proceedings of the 11th international conference on Intelligent user interfaces. 218–225.
- The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media, Vol. 14. 830–839.
- Charles Berret and Tamara Munzner. 2022. Iceberg Sensemaking: A Process Model for Critical Data Analysis and Visualization. arxiv.org (4 2022).
- Capturing and visualizing provenance from data wrangling. IEEE computer graphics and applications 39, 6 (2019), 61–75.
- Virginia Braun and Victoria Clarke. 2012. Thematic analysis. American Psychological Association.
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018).
- Using Machine Learning to Support Qualitative Coding in Social Science. ACM Transactions on Interactive Intelligent Systems 8 (6 2018), 1–20. Issue 2. https://doi.org/10.1145/3185515
- Utopian: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE transactions on visualization and computer graphics 19, 12 (2013), 1992–2001.
- Trrack: A Library for Provenance-Tracking in Web-Based Visualizations, In IEEE Visualization Conference (VIS). 116–120. https://doi.org/10.1109/VIS47514.2020.00030
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]
- Semantic concept spaces: Guided topic model refinement using word-embedding projections. IEEE transactions on visualization and computer graphics 26, 1 (2019), 1001–1011.
- Anna Fariha and Alexandra Meliou. 2019. Example-driven query intent discovery: Abductive reasoning using semantic similarity. arXiv preprint arXiv:1906.10322 (2019).
- ThemeDelta: Dynamic segmentations over temporal topic models. IEEE transactions on visualization and computer graphics 21, 5 (2015), 672–685.
- Marti A Hearst and Duane Degler. 2013. Sewing the seams of sensemaking: A practical interface for tagging and organizing saved search results. In Proceedings of the symposium on human-computer interaction and information retrieval. 1–10.
- Code saturation versus meaning saturation: how many interviews are enough? Qualitative health research 27, 4 (2017), 591–608.
- Scholastic: Graphical Human-AI Collaboration for Inductive and Interpretive Text Analysis. The 35th Annual ACM Symposium on User Interface Software and Technology. https://doi.org/10.1145/3526113.3545681
- TopicSifter: Interactive search space reduction through targeted topic modeling. In 2019 IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, IEEE, Vancouver, Canada, 35–45.
- Architext: Interactive hierarchical topic modeling. IEEE transactions on visualization and computer graphics 27, 9 (2020), 3644–3655.
- Beyond the guise of saturation: rigor and qualitative interview data. , 607–611 pages.
- Understanding digital transformation in advanced manufacturing and engineering: A bibliometric analysis, topic modeling and research trend discovery. Advanced Engineering Informatics 50 (2021), 101428.
- Thinking inside the box: An evaluation of a novel search-assisting tool for supporting (meta) cognition during exploratory search. Journal of the Association for Information Science and Technology (2023).
- Data Exploration Using Example-Based Methods. Springer.
- Denis Mayr Lima Martins. 2019. Reverse engineering database queries from examples: State-of-the-art, challenges, and research opportunities. Information Systems 83 (2019), 89–100.
- hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2, 11 (2017), 205.
- UMAP: Uniform Manifold Approximation and Projection. The Journal of Open Source Software 3, 29 (2018), 861.
- Explaining semi-supervised text alignment through visualization. IEEE Transactions on Visualization and Computer Graphics 28, 12 (2021), 4797–4809.
- Albine Moser and Irene Korstjens. 2017. Series: Practical guidance to qualitative research. Part 1: Introduction. European Journal of General Practice 23 (10 2017), 271–273. Issue 1. https://doi.org/10.1080/13814788.2017.1375093
- Tamara Munzner. 2014. Visualization analysis and design. CRC press.
- Jakob Neilson. [n. d.]. 10 usability heuristics for user interface design. https://www.nngroup.com/articles/ten-usability-heuristics/
- Jakob Nielsen. 1992. Finding Usability Problems through Heuristic Evaluation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 373–380. https://doi.org/10.1145/142750.142834
- Jakob Nielsen and Rolf Molich. 1990. Heuristic Evaluation of User Interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 249–256. https://doi.org/10.1145/97243.97281
- Topic modelling for qualitative studies. Journal of Information Science 43, 1 (2017), 88–102.
- Thematic analysis: Striving to meet the trustworthiness criteria. International journal of qualitative methods 16, 1 (2017), 1609406917733847.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
- Radim Rehurek and Petr Sojka. 2011. Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic 3, 2 (2011).
- Tim Rietz and Alexander Maedche. 2021. Cody: An AI-Based System to Semi-Automate Coding for Qualitative Research. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3411764.3445591
- Topic modeling revisited: New evidence on algorithm performance and quality metrics. Plos one 17, 4 (2022), e0266325.
- Favourate Y Sebele-Mpofu. 2020. Saturation controversy in qualitative research: Complexities and underlying assumptions. A literature review. Cogent Social Sciences 6, 1 (2020), 1838706.
- Provenance for visualizations: Reproducibility and beyond. Computing in Science & Engineering 9, 5 (2007), 82–89.
- A Survey of Human-Centered Evaluations in Human-Centered Machine Learning. In Computer Graphics Forum, Vol. 40.3. Wiley Online Library, 543–568.
- OCTIS: Comparing and Optimizing Topic models is Simple!. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Online, 263–270. https://doi.org/10.18653/v1/2021.eacl-demos.31
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Survey on the analysis of user interactions and visualization provenance. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 757–783.
- A survey of visual analytics techniques for machine learning. Computational Visual Media 7 (2021), 3–36.