Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An AI-Powered Research Assistant in the Lab: A Practical Guide for Text Analysis Through Iterative Collaboration with LLMs (2505.09724v2)

Published 14 May 2025 in cs.CL, cs.AI, and cs.HC

Abstract: Analyzing texts such as open-ended responses, headlines, or social media posts is a time- and labor-intensive process highly susceptible to bias. LLMs are promising tools for text analysis, using either a predefined (top-down) or a data-driven (bottom-up) taxonomy, without sacrificing quality. Here we present a step-by-step tutorial to efficiently develop, test, and apply taxonomies for analyzing unstructured data through an iterative and collaborative process between researchers and LLMs. Using personal goals provided by participants as an example, we demonstrate how to write prompts to review datasets and generate a taxonomy of life domains, evaluate and refine the taxonomy through prompt and direct modifications, test the taxonomy and assess intercoder agreements, and apply the taxonomy to categorize an entire dataset with high intercoder reliability. We discuss the possibilities and limitations of using LLMs for text analysis.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)

Summary

This paper, "An AI-Powered Research Assistant in the Lab: A Practical Guide for Text Analysis Through Iterative Collaboration with LLMs" (Carmona-Díaz et al., 14 May 2025 ), presents a practical, step-by-step methodology for leveraging LLMs in social science research for the analysis of unstructured text data. It addresses the challenges of traditional text analysis methods, such as manual coding (time-consuming, labor-intensive, prone to human bias) and simple automated methods (limited contextual understanding, difficulty with nuance). The proposed approach focuses on collaboratively developing, testing, and applying taxonomies for text classification using an iterative process involving both human researchers and LLMs.

The core of the paper is an eight-step tutorial designed to guide researchers through the process, illustrated with an example dataset of personal goals collected from participants in a paper on socioeconomic status and goal pursuit. The motivation for using an LLM in this example was the large dataset size (over 3,185 goals) and the need for a bottom-up taxonomy reflecting the participants' actual language, rather than a potentially ill-fitting top-down academic taxonomy.

Here is a breakdown of the eight steps:

  1. Write the Initial Prompt: This foundational step involves crafting a detailed prompt for the LLM (or human coder) that includes four key elements:
    • Context: The research question, relevant definitions, and characteristics of the data. This ensures the generated taxonomy is relevant to the specific research goals.
    • Role: Assigning a role (e.g., AI assistant helping a social researcher) provides additional context for the LLM.
    • Task Description: A specific and clear description of the task, including requirements for the taxonomy (e.g., hierarchical vs. flat, allowing overlap, desired number of categories).
    • Expected Output: Defining the desired format (e.g., bullet points) and content (e.g., category label, definition, examples from data) for the generated taxonomy.
  2. Generate the Taxonomy: The initial prompt and the relevant dataset are provided to the LLM. The paper notes that for large datasets, using an LLM API (like OpenAI's via Python) is necessary rather than a chat interface. The input data should be simplified, containing only necessary, anonymized information (e.g., participant ID and the text to be classified). For complex data, structuring it into a single narrative string can be effective.
  3. Evaluate the Taxonomy: Once the LLM generates an initial taxonomy, human and/or LLM evaluators assess its quality using a rubric. The evaluation focuses on four criteria:
    • Relevance: Does it help answer the research question?
    • Mutual Exclusivity: Are categories distinct with minimal overlap?
    • Hierarchical Coherence: Are categories at the same level of abstraction (for flat taxonomies) or logically organized across levels (for hierarchical)?
    • Parsimony: Is the number of categories appropriate, avoiding unnecessary complexity? Evaluators provide binary ratings (meet/not meet) with justification and qualitative feedback on weaknesses and recommendations for improvement.
  4. Revise and Adjust the Initial Prompt (if Necessary): If the evaluation reveals significant weaknesses (e.g., low relevance, poor structure), the original prompt is revised. This might involve clarifying the research context (Step 1.1), adding specific requirements to the task description (Step 1.3), or even modifying the role description (Step 1.2) or input data format. The process then returns to Step 2.
  5. Adjust the Taxonomy: If the evaluation indicates only minor issues, researchers can directly modify the generated taxonomy. This includes refining category labels, descriptions, or examples, or merging similar categories. This step results in a refined version of the taxonomy ready for testing.
  6. Test the Taxonomy: The refined taxonomy is tested for clarity and comprehensiveness by having multiple coders (human and/or LLM) classify a subset of the data.
    • A new prompt is created specifically for the classification task, providing the taxonomy and classification rules but omitting sensitive research details like hypotheses to avoid bias.
    • The number of items in the test subset should be sufficient for calculating intercoder reliability metrics (e.g., considering required sample size for Kappa, ICC, or Alpha) and ensuring representation across categories (e.g., >10 items per category).
    • An "Orphans" category is included to identify items that don't fit any existing category, assessing comprehensiveness.
    • For LLM classification, the paper recommends running the classification multiple times (e.g., 5 times with GPT) and using a majority vote (e.g., classification agreed upon at least 3 times) for robustness.
    • Intercoder reliability is assessed using appropriate statistical indices (Cohen's Kappa, Fleiss' Kappa, ICC, Krippendorff's Alpha). Reliability metrics per category are also useful.
  7. Make Final Adjustments (if Necessary): If intercoder reliability is below an acceptable threshold, coders review disagreements to identify sources of confusion. Adjustments focus on improving clarity:
    • Revising category descriptions (labels, definitions, examples).
    • Adding explicit classification rules to address ambiguous cases or define category boundaries.
    • If many items fall into the "Orphans" category and share a common theme relevant to the research, a new category might be added. The taxonomy is refined, and optionally, testing (Step 6) is repeated with a new subset until satisfactory reliability is achieved.
  8. Apply the Taxonomy: Once the taxonomy is finalized and validated through testing, it is used to classify the entire dataset. The output is typically a structured format (e.g., a dataframe) where each data point is scored according to its fit with each category. This classified data can then be used for statistical analysis (e.g., calculating category frequencies, exploring relationships with other variables). The paper concludes this step by emphasizing the importance of discussing the theoretical implications of the resulting classification and comparing the bottom-up, data-driven taxonomy to existing frameworks.

The paper highlights several advantages of this LLM-assisted approach, including the ability to process large volumes of text data efficiently, maintain high contextual understanding and consistency, and offer flexibility through prompt adjustments, while requiring significantly less human effort compared to manual methods. Human researchers transition from primary coders to managers and evaluators of the AI assistants.

However, the authors also discuss important limitations and ethical considerations. These include the potential for LLM hallucinations (fabricating plausible but incorrect information), biases inherited or amplified from training data (related to social categories), the need for data anonymization and participant consent, and ensuring meaningful human control over the process rather than full automation. The iterative nature of the proposed method, particularly the evaluation and adjustment steps involving human oversight, is presented as a way to mitigate some of these risks.

In summary, the paper provides a practical, structured guide for researchers to effectively integrate LLMs into qualitative text analysis workflows, specifically for taxonomy development and application, while acknowledging the current limitations and ethical considerations associated with using these powerful tools.

Youtube Logo Streamline Icon: https://streamlinehq.com