Collaborative Theme Identification (CoTI)

Updated 25 December 2025

Collaborative Theme Identification Agent (CoTI) is a multi-agent LLM framework that automates qualitative thematic analysis of unstructured interview data such as clinical interviews.
It employs specialized agents for extraction, evaluation, and refinement, reducing manual labor and aligning theme outputs with expert annotations.
Its iterative human-in-the-loop workflow and optimized prompt engineering enable fast, scalable, and precise thematic coding across multiple domains.

The Collaborative Theme Identification Agent (CoTI) is a multi-agent LLM framework engineered for automated qualitative analysis, particularly the identification and synthesis of latent themes in unstructured interview data such as clinical or patient narratives. CoTI systematically decomposes the thematic analysis process into specialized agent roles coordinated via structured communication and, optionally, human feedback loops, with explicit mechanisms for prompt optimization, iterative evaluation, refinement, and codebook generation. CoTI demonstrates performance that matches or exceeds junior investigators and traditional LLM baselines on both coverage and alignment to expert-annotated themes, with substantial reductions in manual labor and increased scalability for clinical research, healthcare interviews, and other domains requiring thematic coding (Xu et al., 26 Mar 2025, Xu et al., 18 Dec 2025).

1. Multi-Agent Architecture and Functional Components

CoTI relies on a modular agent system with strict role separation, stage-based turn taking, and explicit human-in-the-loop design. Typical instantiations, as in both the TAMA clinical interview framework and heart failure patient analysis studies, utilize three core agents:

Generation/Extraction Agent ("Thematizer" or "CoTI Extractor"): Segments transcripts (≤1,500 words per chunk), performs inductive coding, and emits both discrete codes (name, description, direct quote) and preliminary themes per chunk.
Evaluation Agent ("CoTI Critic"): Consumes draft themes; scores each theme using four criteria: Coverage, Actionability, Distinctiveness, and Relevance; generates granular, actionable feedback in the form of add, split, combine, or delete operations.
Refinement Agent ("CoTI Refiner"): Applies feedback atomically, updating the theme set according to add, split, combine, delete actions to optimize thematic structure.
Instructor Agent (in some versions): Utilizes a heavyweight reasoning model (e.g., QwQ-32B) to refine the extraction and reasoning prompts through multistage loops—Clue Extraction, Reasoning Generation, Evaluation, and Optimization.

Human experts or junior investigators may be integrated at multiple stages:

Initial background/goals definition
Criteria and example specification for evaluation
Option to approve, terminate, or re-initiate refinement cycles
Interactive feedback via user interface

All inter-agent communication adheres to a fixed schema with metadata header and payload, ensuring reproducibility and auditability (Xu et al., 26 Mar 2025, Xu et al., 18 Dec 2025).

2. Algorithmic Workflow and Prompt Engineering

The CoTI pipeline employs a multi-phase, iterative workflow:

Preprocessing: Transcripts are chunked for length, optionally quality-checked, and fed to the system.
Prompt Optimization (Instructor): Prompts for clue extraction and reasoning are refined iteratively using LLM-based feedback loops. Optimization objectives include enforcing direct quote dependence, causal completeness, and exclusivity of topic assignment.
Thematization (Thematizer/Extractor):
- Refined prompts guide extraction of direct-quote clues per transcript.
- Reasoning chains (stepwise, mechanism-focused, grounded in clues) are generated.
- Multiple runs (typically N=3) per transcript expand clue/theme set coverage.
Evaluation & Refinement:
- The Evaluation Agent threads each theme through four expert criteria. Sample evaluation prompt: "Evaluate each theme for Coverage, Actionability, Distinctiveness, and Relevance..."
- Atomic refinement actions are applied: add, split, combine, delete.
Codebook Generation (CodebookGenerator):
- Each per-interview theme is embedded (via text-embedding-3-small).
- Themes with cosine similarity above a threshold (e.g., 0.8) are clustered as a codebook entry.

The workflow is summarized by the following high-level pseudocode, which outlines both core logical steps and the human-in-the-loop decision process:

$\begin{algorithmic}[1] \Require D: Transcripts, E: Expert, θ: Similarity threshold \State \text{/* Step 1: Prompt design */} \State P_{gen} \leftarrow E.\text{createPrompt}(D, \text{goals}) \State C \leftarrow \text{chunk}(D) \ForAll{\text{chunk } c \in C} \State \{codes_c, themes_c\} \leftarrow \text{GenAgent}.analyze(c, P_{gen}) \EndFor \State T^{(0)} \leftarrow \bigcup_c themes_c \Repeat \State \text{/* Step 3: Criteria definition */} \State P_{eval} \leftarrow E.\text{createEvalPrompt}(\text{criteria}, \text{examples}) \State \text{feedback} \leftarrow \text{EvalAgent}.score(T^{(k)}, P_{eval}) \State T^{(k+1)} \leftarrow \text{RefineAgent}.apply(T^{(k)}, \text{feedback}) \State \textit{stop} \leftarrow E.\text{approve}(T^{(k+1)}) \Until{\textit{stop} = \text{true}} \State \Return Final themes T^{(k+1)} \end{algorithmic}$

(Xu et al., 26 Mar 2025, Xu et al., 18 Dec 2025)

3. Evaluation Metrics, Quantitative Results, and Comparative Analysis

Agent and human (manual TA) outputs are compared using an array of quantitative metrics:

Jaccard Similarity:

$J(A,B) = \frac{|A \cap B|}{|A \cup B|}$ Used for both clue sets and theme/code overlap; lower internal Jaccard within CoTI themes indicates higher distinctiveness.

Cosine Similarity:

$\cos(u, v) = \frac{u \cdot v}{\|u\| \|v\|}$ Measures alignment between theme/codebook embeddings (text-embedding-3-small).

Hit Rate:

$\textrm{HitRate} = \frac{|\{ t_i \in T : \max_{l_j \in L} s(t_i, l_j) > \theta \}|}{n}$ Where $T$ is human theme set, $L$ is LLM theme set.

Precision, Recall, F1 for clue extraction.
Cohen’s κ (proposed):

$\kappa = \frac{P_o - P_e}{1 - P_e}$

Empirical results demonstrate:

Method	Jaccard (Clues)	Precision	Recall	F1	Theme Cosine	Codebook Cosine	Processing Time
Basic QwQ-32B	0.374	0.496	0.635	0.540	0.411	0.508	~3 min/interview
CoTI	0.403	0.550	0.630	0.569	0.431	0.621	~1 min/interview
TAMA baseline	0.42*	–	–	–	–	–	~30 hr/manual
TAMA multi-agent	0.29*	–	–	–	–	–	<10 min (full)

* Jaccard Similarity: LLM theme vs. human theme (Xu et al., 26 Mar 2025, Xu et al., 18 Dec 2025).

These results indicate that multi-agent, evaluation/refinement cycles in CoTI and TAMA yield higher hit rates (up to 0.92), improved alignment to senior expert-authored themes, and lower inter-theme overlap compared to both junior human coders and single-agent or unsupervised baselines (LDA, Top2Vec, BERTopic). Processing times are reduced by up to 99% compared to manual thematic analysis.

4. Human-AI Collaboration: Interaction, Applications, and Observed Behaviors

CoTI was implemented as a web-based application for semi-structured clinical interview analysis:

Front end in JavaScript/React, user provides Azure OpenAI key.
Users upload interview transcripts and receive clues, themes, and reasoning chains in minutes.
Interactive feedback loop: Investigators may request reruns or provide free-form feedback before finalizing results.

Empirical studies found that:

CoTI output clustered significantly closer to senior investigator codebooks and themes than junior investigator output.
Collaboration with junior investigators yielded only marginal improvements in recall and sometimes reduced thematic distinctiveness, possibly due to overreliance (automation bias).
Manual thematic analysis by experts remained a high bar for thematic nuance, particularly when inter-coder reliability is critical (Xu et al., 18 Dec 2025).

CoTI also generalizes readily across domains (e.g., COVID-19 qualitative interviews in Sierra Leone), with modular prompts and refinement strategies customizable depending on discipline and transcript complexity.

5. Best Practices, Limitations, and Future Directions

Best Practices:

Incorporate domain experts for prompt engineering, evaluation criterion definition, and final approval; adapt agent prompts for context-specific phenomena.
Tune transcript chunk sizes to domain needs; extend refinement action space for specialized domains (e.g., relabel, reorder).
For inter-coder reliability, deploy parallel agent cycles and compare results quantitatively (e.g., Jaccard, Cohen’s κ) (Xu et al., 26 Mar 2025).
Adjust clustering thresholds for codebook generation based on dataset semantics.

Limitations:

Most experiments used single-expert/clinician feedback, limiting generalizability and robustness claims.
Over-refinement risks hallucination or proliferation of low-frequency themes if not restrained by human oversight.
Automation bias was observed among less-experienced investigators, curtailing critical revision.
No formal statistical significance testing or cross-domain inter-rater reliability yet implemented.

Future Enhancements:

Integrate senior investigator iterative feedback (expert-in-the-loop) directly into prompt optimization cycles.
Extend to additional languages, domains, and larger sample sizes.
Incorporate active learning via low-confidence theme flagging.
Algorithmic innovation in clustering (e.g., dynamic thresholding, hierarchical or spectral methods) for codebook construction.
UI improvements for more granular transcript–theme annotation and parallel adjudication.

A plausible implication is that CoTI, by structurally decomposing qualitative analysis into modular, auditable agent roles, offers a reproducible and extensible scaffold for clinical and social science research, enabling greater scalability and consistency in thematic discovery while preserving the option for iterative, high-fidelity human feedback (Xu et al., 26 Mar 2025, Xu et al., 18 Dec 2025).