Human-AI Collaborative Taxonomy Construction

Updated 19 November 2025

Human-AI collaborative taxonomy construction is a process that integrates domain expert feedback with AI-generated proposals to design structured and reliable classification schemes.
It employs iterative, human-in-the-loop workflows where dialog-guided refinement and validation metrics like Cohen’s κ ensure taxonomy stability and practical utility.
Applications across legal, citizen science, software engineering, and technical service automation underscore its significance in creating context-aware, adaptive taxonomies.

Human-AI collaborative taxonomy construction is the process of designing, refining, and validating structured classification schemes (taxonomies) by systematically integrating human expertise and AI/ML capabilities. This paradigm underpins a wide range of activities, including profession-specific writing assistants, human-in-the-loop data curation, artifact tracking in automated workflows, reinforcement learning, and developer-AI tooling. State-of-the-art approaches leverage iterative, dialog-guided workflows where LLMs or other ML systems generate taxonomic candidates, and domain experts provide critical feedback, ultimately producing application-tailored taxonomies with measurable reliability and transparency (Lee et al., 2024, Meier et al., 2023, and et al., 2023). The following sections synthesize recent research on methods, frameworks, and evaluation of human-AI collaborative taxonomy construction across domains.

1. Methodological Foundations: Iterative Human-AI Taxonomy Construction

Recent profession-specific taxonomy construction methods instantiate a three-stage pipeline integrating LLMs and domain experts:

Taxonomy Generation: An LLM receives a domain description and (optionally) prototypical data or examples. Via hierarchical prompting, it generates an initial set of candidate high-level labels, definitions, and examples—each augmented with the LLM's internal reasoning (Lee et al., 2024).
Taxonomy Validation (LLMs as Mediators): Two roles are defined—an Interviewer LLM conducts structured multi-turn dialogues with human experts to elicit feedback (clarity, overlap, omissions); a Creator LLM applies this feedback, revising the taxonomy iteratively until convergence. This process is repeated across multiple experts, producing independently curated taxonomies.
Merging and Reliability Testing: An Aggregator LLM merges multiple expert-validated taxonomies, ensuring mutual exclusivity and collective exhaustiveness. Human experts and LLMs then independently annotate new domain instances using the final taxonomy, and inter-coder reliability is quantified via metrics such as Cohen’s κ. High κ indicates taxonomy stability and utility in downstream applications (e.g., legal email revision assistants where κ_human ≈ 0.78 and κ_cross ≈ 0.75 signify high annotation agreement) (Lee et al., 2024).

This architecture is operationalized as a set of coordinated software services: web-based interfaces for expert involvement, back-end orchestration of LLM APIs, prompt layering, dialogue state management, and persistent storage of taxonomy versions, annotations, and validation logs (Lee et al., 2024).

2. Multidimensional Design of Human-AI Collaborative Taxonomies

Taxonomies are systematically characterized along multiple orthogonal dimensions. Prominent frameworks include:

Agency, Interaction, and Adaptation Model (Holter et al., 2024):
- Agency: Distribution (human, AI, mixed), Allocation (pre-determined, negotiated).
- Interaction: Intent (receive guidance, explore, provide/request feedback), Degree (orienting, directing, prescribing), Focus (system, data, task, etc.), Feedback Type (explicit, implicit, both).
- Adaptation: Which agents adapt, how (task or communication improvement), and what information is learned (domain, task, agent goals/preferences).
Artifact-Centric Taxonomies (and et al., 2023):
- Source: Human, Data, AutoML processes, System, Organization.
- Transmission Mode: Boundary-crossing (human↔machine), non-boundary (within human or within machine).
- Artifact Format: Numeric, textual, tabular, tensor, graph, specification, report.
- Task Purpose: Informing, exploring, governing, sharing, steering.
Task and Learning Perspective (Dellermann et al., 2021):
- Task Characteristics: Recognition, prediction, reasoning, action; goal alignment; data representation; intervention timing.
- Learning Paradigm: Human- or machine-centric, supervised/unsupervised/reinforcement/semi-supervised.
- Teaching Methods: Demonstration, labeling, troubleshooting, verification; explicit vs. implicit interaction; single vs. collective input.
- Feedback and Interpretability: Query strategy, feedback types, interpretability tier.

The combination of such dimensions—often represented as Cartesian product design spaces—enables rigorous mapping, comparison, and extension of taxonomy construction methods across application domains (Holter et al., 2024, Dellermann et al., 2021, and et al., 2023).

3. Workflow Patterns, System Architectures, and Interaction Modes

Human-AI taxonomy construction workflows vary depending on the level of automation and expert involvement:

Human-in-the-Loop: Explicit curation, correction, or labeling of ML/AI-generated clusters, candidates, or artifacts, with visualization interfaces supporting iterative refinement and interpretive authority retained by the human (Meier et al., 2023, and et al., 2023).
Human-on-the-Loop/Out-of-the-Loop: Automated or semi-automated construction, where human experts primarily monitor or audit AI outputs, intervening only by exception, e.g., via confidence thresholds or proactive supervision (Wulf et al., 18 Jul 2025).
Co-Creative and Integrative Models: Mixed-initiative workflows, where humans and AI interleave proposal, critique, and adaptation, facilitated by real-time feedback, visualization of embedding or cluster variability, and dialogic reasoning (often via LLM Interviewer/Critic roles) (Lee et al., 2024, Meier et al., 2023).

Representative system architectures include modular streamlit/React front-ends, Python-based orchestration of prompt chains and conversation history, and persistent annotation/feedback logs (Lee et al., 2024). Visual analytics tools such as AutoMLTrace leverage the artifact taxonomy to track temporality and provenance within human-AI workflows (and et al., 2023).

4. Case Studies and Application Domains

Human-AI collaborative taxonomy construction is instantiated in a spectrum of domains:

Profession-Specific Writing Assistants: Legal email revision intentions taxonomy, constructed by LLM-assisted expert authoring and validated by annotation agreement (Lee et al., 2024).
Heterogeneous Text Organization: Data-driven cluster construction for citizen science questions and open government metadata using embedding-based KNN search and iterative visual curation; small-multiples visualization to expose model variability (Meier et al., 2023).
Software Engineering Tools: Taxonomies of developer–AI interaction modes enable annotation and analysis of coding assistants, yielding classification schemes covering auto-completion, refactoring, contextual recommendations, and conversational assistance (Treude et al., 15 Jan 2025).
Technical Service Automation: Six-mode spectrum (HAM, HIC, HITP, HITL, HOTL, HOOTL) links modes of collaboration to risk, task complexity, and system trust, prescribing workflow architectures from full human oversight to total autonomy (Wulf et al., 18 Jul 2025).

These cases demonstrate that mutual exclusivity and exhaustiveness are routinely enforced by explicit merging and coverage verification, while reliability is quantified via inter-coder agreement on new data.

5. Evaluation, Reliability Metrics, and Design Guidelines

Taxonomy construction validity and utility are established through:

Annotation Reliability: Multiple human and AI annotators independently code domain artifacts; agreement is quantified using κ or other inter-rater metrics, guiding further refinement (Lee et al., 2024, Meier et al., 2023).
Systematic Stopping Criteria: Iterative cycles terminate when no further substantive suggestions or objections occur, and when the taxonomy achieves stability (no new categories in successive iterations) (Dellermann et al., 2021, and et al., 2023).
Design Guidance:
- Expose Model Variability: Provide visual access to multiple possible ML/LLM outputs to surface disagreement, fostering interpretive, rather than authoritative, ML assistance (Meier et al., 2023).
- Iterative Human-in-the-Loop Correction: Ensure that all final class definitions are human-named, merged, split, or redefined, preventing unexamined delegation to AI clustering.
- Task and Goal Alignment: Align taxonomy structure with application-specific objectives and intervention points (e.g., feature engineering, action recommendation) (Dellermann et al., 2021).
- Transparent Reasoning Display: Retain model-generated rationales for each class or revision, grounding human trust and fostering justifiable, comprehensible taxonomies (Lee et al., 2024).

Designers are advised to select and tailor taxonomy construction workflows and system architectures in accordance with application risk, complexity, and available human expertise (Wulf et al., 18 Jul 2025, Dellermann et al., 2021).

6. Challenges, Limitations, and Future Directions

Persistent challenges in human-AI collaborative taxonomy construction include:

Ambiguity in Model Outputs: High-dimensional embedding spaces may yield divergent clusterings depending on model or metric selection, necessitating visualization and human comparison (Meier et al., 2023).
Scalability and Generalization: Existing human-in-the-loop interfaces risk overload with large datasets or complex taxonomies; balancing expert oversight and automation remains nontrivial (Lee et al., 2024, and et al., 2023).
Evaluation Standards and Benchmarking: Standardized inter-coder datasets, transferability benchmarks, and longitudinal evaluation of taxonomy stability are underdeveloped (Li, 2024).
Trust, Interpretability, and Control: Over-reliance on black-box AI for taxonomy induction can erode user control and propagate unnoticed biases; solutions include multi-model output display, explicit rationale documentation, and enforced human sign-off at all stages (Meier et al., 2023, Lee et al., 2024).

Future research directions include incorporation of multi-modal feedback channels, adaptive user modeling, richer explainability mechanisms, and extending collaborative taxonomy construction to dynamic, multi-agent, and real-time learning settings (Li, 2024).

References

"Human-AI Collaborative Taxonomy Construction: A Case Study in Profession-Specific Writing Assistants" (Lee et al., 2024)
"To Classify is to Interpret: Building Taxonomies from Heterogeneous Data through Human-AI Collaboration" (Meier et al., 2023)
"Tracing and Visualizing Human-ML/AI Collaborative Processes through Artifacts of Data Work" (and et al., 2023)
"A Design Trajectory Map of Human-AI Collaborative Reinforcement Learning Systems: Survey and Taxonomy" (Li, 2024)
"Deconstructing Human-AI Collaboration: Agency, Interaction, and Adaptation" (Holter et al., 2024)
"How Developers Interact with AI: A Taxonomy of Human-AI Collaboration in Software Engineering" (Treude et al., 15 Jan 2025)
"Architecting Human-AI Cocreation for Technical Services -- Interaction Modes and Contingency Factors" (Wulf et al., 18 Jul 2025)
"The future of human-AI collaboration: a taxonomy of design knowledge for hybrid intelligence systems" (Dellermann et al., 2021)