QRMine: Python CLI for Grounded Theory Research
- QRMine is an open-source Python package that automates Grounded Theory coding and triangulation with integrated NLP and ML tools.
- It offers both a Click-based CLI and a Python API to process text and CSV data, facilitating iterative qualitative and quantitative analysis.
- The package supports extensible pipelines including topic modeling (LDA), sentiment analysis, and clustering to provide actionable insights for GT research.
QRMine is an open-source Python package providing command-line and module-based utilities for computational triangulation and the systematic coding of qualitative and quantitative data in Grounded Theory (GT) research. It integrates NLP and ML techniques to automate and accelerate core GT stages such as open coding, axial coding, selective coding, and triangulation, facilitating the corroboration of qualitative concepts with quantitative evidence. QRMine is available through the Python Package Index (PyPI) and supports extensible pipelines via both a Click-based CLI and a Python API, leveraging libraries such as spaCy, gensim, scikit-learn, and TensorFlow for text and numeric data analysis (Eapen et al., 2020).
1. Installation, Setup, and System Requirements
QRMine is designed for Python 3.6+ environments and is distributed via PyPI. Installation is initiated in a clean virtual environment, created via venv or conda, followed by pip install qrmine. The package requires spaCy and its English “small” LLM (en_core_web_sm) for linguistic analysis. To enable GPU-accelerated and high-performance numeric computation, optional dependencies include TensorFlow (either CPU or GPU builds) and Keras. The CLI script becomes available on the path after installation, and module APIs can be imported into Python or Jupyter workflows.
Sample setup instructions:
6
Optional: pip install tensorflow keras for accelerated neural network tasks.
2. Command-Line and Module Interfaces
All QRMine operations are invoked via the qrmine base command, supplemented with subcommands and flags for distinct GT and ML workflows. Inputs can be plain text (transcripts), CSV (numeric data, identifiers, dependent variables), or combinations thereof. Filtering by topic, sentiment, or specific transcript sections is supported. The CLI supports output to both STDOUT and user-defined files, and most commands accept configurable limits (e.g., top-N categories or number of topics).
The Python module exposes three principal classes:
ReadData: for ingesting and parsing text and CSV data.Qrmine: encapsulating NLP-based coding and topic modeling.MLQRMine: providing wrappers for numeric machine learning algorithms (see Section 4).
Typical method signatures include: 7
Major CLI flags and commands are summarized below:
| Flag / Command | Functionality | Default/Example Value |
|---|---|---|
-i, --input |
Input transcript(s) (txt/CSV) | "transcript.txt" |
--csv |
Input numeric (CSV) data | "data.csv" |
--cat |
Top N repeating verbs (open) | -n 10 (overrideable) |
--codedict |
Axial coding dictionary | -n 10 (overrideable) |
--topics |
LDA topic modeling | -n 3 (overrideable) |
--assign |
Assign docs to topics | |
--sentiment |
VADER sentiment analysis | --sentence (optional) |
--nnet, --svm, |
Numeric ML tasks (see Section 4) | |
--kmeans, --knn, |
||
--pca |
3. Methodological and Algorithmic Foundations
QRMine operationalizes standard and contemporary ML and NLP methodologies for GT:
- Textual Coding: Open coding extracts the most frequent verbs () via spaCy lemmatization; axial coding generates dictionaries linking verbs to adjacent adjectives/adverbs based on syntactic dependency parse outputs.
- Topic Modeling: Implements Latent Dirichlet Allocation (LDA) using a term–document matrix (TF-IDF weighted). LDA is optimized via collapsed Gibbs sampling or variational EM to derive topic–word () and document–topic () distributions. Topic assignment assigns each document to topic .
- Numeric ML:
- Neural network classifier fits cross-entropy loss: .
- SVM solves: subject to .
- K-means: for clusters 0.
- PCA: Eigen-decompose covariance 1, retain principal components.
- k-NN: Retrieve 2 nearest neighbours by Euclidean/cosine distance.
- Computational Triangulation: After generating topic assignments (3) and numeric structures (4), similarity is measured, e.g., via cosine similarity: 5, thus corroborating qualitative themes with quantitative clusters.
4. Principal Features, Data Flows, and GT Stage Mapping
QRMine supports the full GT pipeline:
- Open Coding: (
--cat,get_categories) yields frequent verb/concept lists. - Axial Coding: (
--codedict,build_codedict) maps concepts to descriptors. - Selective Coding: (
--topics,topic_model) uncovers latent themes and core categories. - Triangulation: Integration of topic vectors and numeric cluster memberships for corroborative analysis.
Data formats are standardized: plain text (optionally tagged with <break>TITLE</break> per section/interview) for transcripts, and CSV for numeric/categorical data (first column: identifier, last: dependent variable, intermediates: features). Outputs are plain-text tables, JSON dictionaries, or structured Python objects, facilitating downstream quantitative-qualitative integration.
A prototypical GT workflow involves:
- Automatic category extraction and codebook generation.
- Topic modeling and assignment.
- Numeric feature reduction (e.g., PCA) or cluster discovery.
- Quantitative–qualitative alignment via vector similarity or clustering.
5. Technical Architecture, Dependencies, and Extensibility
QRMine exhibits a modular, layered architecture with three core modules: ReadData for I/O, Qrmine for NLP, and MLQRMine for numeric ML. The package directory structure is as follows:
read_data.py: Data parsing routines.qrmine.py: NLP logic and algorithmic wrappers.ml_qrmine.py: ML abstractions.cli.py: Command registration (Click decorators).
Major dependencies:
| Functionality | Libraries |
|---|---|
| NLP preprocessing | spaCy, textacy |
| Sentiment analysis | VaderSentiment |
| Topic modeling | gensim, scikit-learn |
| Numeric ML | scikit-learn, Keras, TensorFlow, imbalanced-learn, mlxtend |
| CLI | Click |
Design patterns such as facade/wrapper abstractions allow QRMine to multiplex over multiple libraries, and modular separation ensures clarity and maintainability. Extensibility is achieved by subclassing Qrmine (to swap NLP/ML methods) or by introducing new CLI commands via Click.
6. Integration, Best Practices, and Contribution Guidelines
The recommended integration route for GT researchers is iterative: begin with open coding in QRMine to establish a core concept list; then perform manual review and refinement in external qualitative software as needed. Selective coding via topic modeling assists with theoretical sampling and prioritization of additional data collection. Numeric ML pipelines serve as a computational triangulation stage, allowing GT researchers to cross-validate emergent concepts with quantitative clusters or principal components.
Version control and collaboration are facilitated via the official repository at https://github.com/dermatologist/nlp-qrmine. Contributions proceed by forking, branching, adhering to dev requirements (pytest, black, mypy), testing, and submitting pull requests. Extensibility points in the API and CLI facilitate adaptation to evolving ML and NLP standards.
A plausible implication is that QRMine operationalizes and accelerates the iterative, data-driven nature of GT, aligning the method with the analytical challenges and opportunities posed by contemporary big data research environments (Eapen et al., 2020).