Automatic Issue-Type Classification
- Automatic issue-type classification is a process that uses machine learning and statistical methods to label software issues with categories such as bug, enhancement, and question.
- Techniques range from bag-of-n-grams and TF–IDF to advanced transformers like BERT and GPT-3.5-turbo to extract and interpret textual features.
- Practical implementations integrate with software workflows, automating issue triage, improving developer-task matching, and streamlining project management.
Automatic issue-type classification refers to the use of statistical and machine learning techniques to assign semantic or functional categories—such as bug, feature/enhancement, question, or required skill domain—to software issue tracker entries based solely on their textual content or associated metadata. This process addresses the workflow bottleneck introduced by high issue throughput in large software repositories and facilitates improved triage, developer assignment, and project management.
1. Taxonomies of Issue Types
Contemporary approaches employ two principal taxonomies for issue classification:
- Intent-based categories: These are canonical in mainstream GitHub workflows and include bug report, enhancement (feature request), and question. These are typically mutually exclusive and represent the default labeling taxonomy for classification tasks, as operationalized in "Ticket Tagger" and related studies (Kallis et al., 2021, Trautsch et al., 2022, Aracena et al., 2024).
- API-domain/skill-based categories: Used in tools such as "GiveMeLabeledIssues," this schema comprises 31 project- and skill-relevant domains (e.g., UI, DB, Test, Networking, Security) and is designed to support developer-task matching beyond the intent dimension, serving as a proxy for the APIs and skills required to resolve the issue. These labels are not mutually exclusive, yielding a multi-label classification setting (Vargovich et al., 2023).
2. Datasets and Preprocessing Protocols
Public datasets typically draw from closed and labeled GitHub issues, with selection and preprocessing adapted to task goals:
- Balanced vs. Unbalanced Sampling: Balanced datasets contain equal numbers of issues per target class to minimize model bias, e.g., 10,000 bugs, enhancements, and questions each (12,112 repositories; Feb 2018) (Kallis et al., 2021). Unbalanced datasets reflect real-world submission frequencies (e.g., 16,355/14,228/3,458) (Kallis et al., 2021).
- Text Normalization:
- Title and body concatenation.
- Basic tokenization (white-space, punctuation splitting).
- For skill/domain labeling, additional normalization includes lowercasing, removal of URLs, in-line code, numbers, punctuation, and application of stemming and stop-word removal (Vargovich et al., 2023).
- In LLM pipelines, further cleaning replaces user handles, URLs, and HTML with sentinel tokens (Aracena et al., 2024).
- Sampling Restrictions: Only closed issues with a single relevant label per issue (to avoid ambiguous cases and post-facto re-labelings).
- API Linking for Skill Domain: Only issues resolved via pull requests (PRs) modifying ≥1 source file are considered when skill-based labeling is required (Vargovich et al., 2023). API usage is extracted via static parsing of import/include/using statements.
3. Feature Engineering Methods
Approaches span a spectrum from low-level n-gram encodings to contextual representation learning:
- Bag-of-n-grams: Classic approaches use character-level (3–6-gram) representations as in fastText, where each word is split into overlapping n-grams and the resulting embeddings are averaged into a fixed-size vector (Kallis et al., 2021).
- TF–IDF over Processed Corpora: For API-domain labeling, the TF–IDF vectorization (with preprocessing) is standard; token frequency is counterbalanced by inverse document frequency (Vargovich et al., 2023).
- Contextual Embeddings: Transformer models (BERT, seBERT, GPT-3.5-turbo) tokenize text into WordPieces, with document-level semantics pooled from special sentinel tokens ([CLS]) (Trautsch et al., 2022, Aracena et al., 2024).
- Exclusions: Metadata such as timestamps and user roles are not included in most intent-type classification models, though future work proposes their integration (Kallis et al., 2021).
4. Classification Models and Training Procedures
Automated issue-type labeling has been implemented using a range of discriminative models:
| Approach | Feature Set | Model Family | Output Type |
|---|---|---|---|
| fastText (Ticket Tagger) | char-n-grams | Linear classifier (HS) | Single label |
| J48 (C4.5) | TF–IDF tokens | Decision Tree | Single label |
| Random Forest (GiveMeLabeledIssues) | TF–IDF tokens + API usage | RF (multi-label) | Multi-label |
| BERT / seBERT | WordPiece tokens | Transformer | Single/Multi |
| GPT-3.5-turbo | Prompted text | LLM + fine-tuning | Single label |
- Ticket Tagger: fastText with hierarchical softmax; word-level n-grams are disabled for efficiency; feature pruning by frequency (minCount=14) (Kallis et al., 2021). Training is via negative log-likelihood loss over classes.
- GiveMeLabeledIssues: Multi-label Random Forest with optimized hyperparameters (entropy criterion, max_depth=50, n=50 estimators) per project. Alternative BERT-based models are also implemented via fast-bert, with a sigmoid-activated dense head for multi-label output (Vargovich et al., 2023).
- seBERT: Transformer encoder (BERTLARGE, 24 layers, 1024 hidden units), pre-trained from scratch on SE-related corpora, then fine-tuned for issue classification with a softmax layer over three labels and multi-class cross-entropy loss (Trautsch et al., 2022).
- GPT-3.5-turbo: Instruction-prompted, few-shot fine-tuned LLM with temperature=0.0 and max_tokens=1 to yield deterministic, single-token classifications based on a prompt encompassing both title and body. Fine-tuned using 300 labeled issues per repository, with epochs tailored to dataset and convergence (Aracena et al., 2024).
5. Evaluation Metrics and Benchmark Results
Standardized metrics are used to quantify model effectiveness:
- Precision (), Recall (), F1-score () for each class :
- Macro/micro-averaged F1: Micro-averaged F1 used to aggregate over classes; Hamming loss is used for multi-label skill-domain settings (Vargovich et al., 2023).
Performance highlights from major approaches include:
| Model | Data/setting | F1-score (macro/micro) | Notable Per-Class Results |
|---|---|---|---|
| fastText (TT) | Balanced | 0.83 | Bug/Enhancement/Question: 0.83/0.82/0.83 |
| fastText (TT) | Unbalanced | 0.75/0.74/0.48 | “Question” label is notably harder |
| Random Forest (GMLI) | Per-project | 0.817 (macro) | Precision up to 0.872 (Audacity). |
| seBERT | Large-scale | 0.857 (micro) | Outperforms fastText by +4.1%; biggest gain for “question” class (+12.8%) (Trautsch et al., 2022). |
| GPT-3.5-turbo | Repo-level | ~0.83 (macro) | Repo-specific F1 to 0.87 (tensorflow), precision up to 0.93, recall up to 0.95 (Aracena et al., 2024) |
In all evaluated studies, transformer models (seBERT, GPT-based) outperform classical baselines (fastText, Random Forest) both in overall and per-class metrics, particularly in recall and F1 for the “question” label (Trautsch et al., 2022, Aracena et al., 2024).
6. Deployment and Integration Workflows
Deployed systems range from cloud microservices to custom bots:
- Ticket Tagger: Packaged as a Node.js GitHub App; on “issue opened” events, concatenates title/body and predicts label using a pre-trained fastText model, updating labels via the GitHub REST API. The system is optimized for rapid inference on low-cost instances (e.g., AWS t2.nano) with a lightweight (~5 MB) footprint (Kallis et al., 2021).
- GiveMeLabeledIssues: Packaged as a Django REST API, supporting nightly (cron) batch classification and SQLite caching for open issues. Model retraining and new project onboarding are automated via project registration scripts and TF–IDF+RandomForest/BERT pipelines (Vargovich et al., 2023).
- LLM/GPT approaches: Fine-tuned models are deployed behind prompt-based APIs; project adaptation involves labeled data collection, prompt engineering, fine-tuning per project, and direct RESTful inference (Aracena et al., 2024).
7. Limitations, Threats to Validity, and Future Directions
Recognized concerns across studies include:
- Label space restriction: Most intent-type classifiers are limited to bug/enhancement/question; project-specific or fine-grained taxonomies require retraining and/or feature schema redesign (Kallis et al., 2021).
- Data distribution shift: Balanced training does not mirror production skew; model performance for rare classes (especially “question”) is lower and confusion is higher (Kallis et al., 2021, Trautsch et al., 2022).
- Generalizability and external validity: Datasets are typically sampled from specific timeframes or repositories, possibly limiting cross-project or temporal applicability (Kallis et al., 2021, Vargovich et al., 2023).
- Feature exclusion: Current best-performing models generally do not leverage non-textual (metadata, code structure) features, although such signals may improve future models (Kallis et al., 2021, Vargovich et al., 2023).
- Sequence length constraints: Models like seBERT are capped at 128 tokens; long-form issues may be incompletely encoded unless longer-span models (e.g., Big Bird) are employed (Trautsch et al., 2022).
- Few-shot adaptability: GPT-3.5-turbo enables effective project-specific classifiers with only 100–300 examples per class, suggesting robustness in low-data regimes, but no cross-repo generalization evaluation is reported (Aracena et al., 2024).
This suggests that further gains may be possible by integrating metadata, dynamically adjusting class thresholds, or using adaptive sequence windows, and that model selection for a given deployment context should match project label schema, available data volume, and required granularity of predictions.
8. Extensions: Skill-Oriented and Multi-Label Classification
Recent advances have extended issue classification to skill and API domain assignment:
- API-Domain Labeling: GiveMeLabeledIssues classifies issues with respect to the APIs affected in their resolving PRs, resulting in a 31-category multi-label vector where each label reflects a skill required (e.g., “DB,” “Test”) (Vargovich et al., 2023). This responds to a need for better developer-task matching and onboarding.
- Hybrid Feature Sets: Integration of code-level (API usage) evidence with text-based features is crucial; code usage is parsed directly from PRs using language-specific patterns.
- Model Selection: Multi-label Random Forests and optionally BERT for sequence representation are the principal classifiers; macro-averaged F1 and Hamming Loss are the primary evaluation metrics.
The practical implication is that tools supporting both intent and skill-domain classification are increasingly relevant for open source communities seeking to optimize both maintenance and contributor engagement pipelines.